Should we pre-split dataset into small dataset when using distributed xgboost?

Sorry to disturb. I’m a beginner to distributed xgboost version. Recently, I’m planning to use distributed xgboost to train rank task.
I find some docs about rank docs for data format here.
But I’am confused about the details. I wonder that if I should split the whole dataset into small and place it onto different node or whole dataset onto each node?

Another question is not related to distributed xgboost. When I do ranking task, how can I plot the rank pairwise loss when training?

How are you setting up distributed training? If you are using Spark or Dask, they offer their own way to split data among workers.

Thanks for replying. We use Kubeflow platform The official operator’s detail is here
Does it mean that we just use the api not concerning about the specific data logic behind? Our train/eval/test dataset follow this data format?

Please file an issue at . I have no idea how data should be formatted for Kubeflow.

Ok, thanks very much.