Hi folks,
I’m trying to scale/accelerate the training of a multiclass model with the multi:softprob
objective to multiple GPUs. While this objective is listed as supported on GPU, it appears as if the metric functions required (mlogloss
or merror
) for multiclass training are not supported.
The result seems to be that multiclass training on multiple GPUs is not actually supported. Is this correct? If so, is support coming soon?
Here are some details of what I’ve tried:
- I can train multiclass models on a single GPU on a multiple GPU machine using a single GPU.
- It is possible to run multi GPU multiclass training with
mlogloss
ormerror
metrics, however training is unstable and blows up after a few iterations. (though it doesn’t crash on the GPU; it quickly finishes all iterations and then crashes the next time I try to callBooster.train
in the program with this errorCheck failed: distribution_.IsEmpty() || distribution.IsEmpty()
). - I’ve tried removing all evals sets (training set only) from the training in case it is a multi-prediction issue (still blows up)
- Passing no explicit evals metrics, or binary evals metrics such as
error
results in a failure similar toCheck failed: preds.size() == info.labels_.Size() (1500000 vs. 500000) label and prediction size not match, hint: use merror or mlogloss for multi-class classification
XGB version is 0.81, compiled from source on ubuntu 16.04 (in a container). CUDA and NCCL were installed via these deb packages from nvidia’s repository: cuda-toolkit-10-0 libnccl2=2.4.2-1+cuda10.0 libnccl-dev=2.4.2-1+cuda10.0
Thanks for your help,
-Jeff