I’m trying to scale/accelerate the training of a multiclass model with the
multi:softprob objective to multiple GPUs. While this objective is listed as supported on GPU, it appears as if the metric functions required (
merror) for multiclass training are not supported.
The result seems to be that multiclass training on multiple GPUs is not actually supported. Is this correct? If so, is support coming soon?
Here are some details of what I’ve tried:
- I can train multiclass models on a single GPU on a multiple GPU machine using a single GPU.
- It is possible to run multi GPU multiclass training with
merrormetrics, however training is unstable and blows up after a few iterations. (though it doesn’t crash on the GPU; it quickly finishes all iterations and then crashes the next time I try to call
Booster.trainin the program with this error
Check failed: distribution_.IsEmpty() || distribution.IsEmpty()).
- I’ve tried removing all evals sets (training set only) from the training in case it is a multi-prediction issue (still blows up)
- Passing no explicit evals metrics, or binary evals metrics such as
errorresults in a failure similar to
Check failed: preds.size() == info.labels_.Size() (1500000 vs. 500000) label and prediction size not match, hint: use merror or mlogloss for multi-class classification
XGB version is 0.81, compiled from source on ubuntu 16.04 (in a container). CUDA and NCCL were installed via these deb packages from nvidia’s repository:
cuda-toolkit-10-0 libnccl2=2.4.2-1+cuda10.0 libnccl-dev=2.4.2-1+cuda10.0
Thanks for your help,