Should we pre-split dataset into small dataset when using distributed xgboost?

maxiaodong · November 12, 2020, 10:17am

Sorry to disturb. I’m a beginner to distributed xgboost version. Recently, I’m planning to use distributed xgboost to train rank task.
I find some docs about rank docs for data format here. https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#embedding-additional-information-inside-libsvm-file
But I’am confused about the details. I wonder that if I should split the whole dataset into small and place it onto different node or whole dataset onto each node?

Another question is not related to distributed xgboost. When I do ranking task, how can I plot the rank pairwise loss when training?

hcho3 · November 12, 2020, 8:47pm

How are you setting up distributed training? If you are using Spark or Dask, they offer their own way to split data among workers.

maxiaodong · November 14, 2020, 2:02pm

hcho3 · November 14, 2020, 7:20pm

Please file an issue at https://github.com/kubeflow/xgboost-operator . I have no idea how data should be formatted for Kubeflow.

maxiaodong · November 16, 2020, 2:56am

Ok, thanks very much.