Run xgboost on Multi Node Multi GPU

Hi community,

Are there any set of steps to run xgboost on multi GPU/ multi node multi GPU on CLI. I am able to compile xgboost with multi GPU support using NCCL. But for running, I am not exactly sure how to use Dask to setup a GPU cluster? If some documentation/help can be provided for it, it will be helpful.

1 Like

Here is a tutorial to start: https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7.

1 Like

Thanks for this. I understand that this is integrated with xgboost python library. What if I want to integrated Dask with the xgboost C++ binary, how can I go about doing that? (need something analogous to submitting job to dmlc-tracker and then calling dask cluster if the job is a GPU job)

Dask is a Python library, so it operates with Python programs. So Dask does not interact with the C++ library directly.

Dask launches workers running XGBoost Python, which then calls C API functions in libxgboost.so.

Ok. So I understand that for the latest version you have support for dask with xgboost python library. For earlier versions, how did you run xgboost model building on multi GPU? In the xgboost 0.82 documentation I see that there is support for multiple GPU run in the release notes:

New feature: Multi-Node, Multi-GPU training (#4095)

  • In Dask, users will be able to construct a collection of XGBoost processes over an inhomogeneous device cluster (i.e. workers with different number and/or kinds of GPUs).

But I am not able understand how in 0.82 can xgboost be used with dask.

@VigneshN1997 In 0.82, there was no convenient wrapper for provisioning worker processes with GPU allocation. You had to manually invoke the DMLC tracker script like this:


While 0.82 contained an initial support for multi-GPU training, it was quite difficult to set up in practice.

Starting with 1.0.0, XGBoost provides a seamless integration with Dask, so that it’s now easy to provision workers from the main process:

with LocalCUDACluster(n_workers=2) as cluster:
    with Client(cluster) as client:
        dask_df = dask.dataframe.read_csv(fname, header=None, names=colnames)
        X = dask_df[dask_df.columns.difference(['label'])]
        y = dask_df['label']
        dtrain = xgb.dask.DaskDMatrix(client, X, y)
        output = xgb.dask.train(client,
                            {'tree_method': 'gpu_hist'},
                            dtrain,
                            num_boost_round=100,
                            evals=[(dtrain, 'train')])

In addition, 0.82 used to support using multiple GPUs from a single process, whereas 1.0.0 drops that support: https://github.com/dmlc/xgboost/issues/4531. The rationale is that managing multiple GPUs from a single process introduces lots of complex code path, and enforcing 1:1 assignment between GPUs and processes greatly simplifies the code. Currently, if you have N GPU cards in your machine, you’d need to provision N worker processes. This is easy to do using the LocalCUDACluster abstraction.

Thank you. I will look into this and get back if I have any doubts.

Although most of nvidia’s rapids.ai packages are python wrappers, you may find some guidance there.