Should we pre-split dataset into small dataset when using distributed xgboost?

Sorry to disturb. I’m a beginner to distributed xgboost version. Recently, I’m planning to use distributed xgboost to train rank task.
I find some docs about rank docs for data format here. https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#embedding-additional-information-inside-libsvm-file
But I’am confused about the details. I wonder that if I should split the whole dataset into small and place it onto different node or whole dataset onto each node?

Another question is not related to distributed xgboost. When I do ranking task, how can I plot the rank pairwise loss when training?

How are you setting up distributed training? If you are using Spark or Dask, they offer their own way to split data among workers.

Thanks for replying. We use Kubeflow platform https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html. The official operator’s detail is here https://github.com/kubeflow/xgboost-operator/blob/56c2c07525a826c285cf0efed99e636edddb676f/config/samples/xgboost-dist/utils.py#L103.
Does it mean that we just use the api not concerning about the specific data logic behind? Our train/eval/test dataset follow this data format? https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#embedding-additional-information-inside-libsvm-file

Please file an issue at https://github.com/kubeflow/xgboost-operator . I have no idea how data should be formatted for Kubeflow.

Ok, thanks very much.