Metric functions for multiclass training on multiple GPUs


#1

Hi folks,

I’m trying to scale/accelerate the training of a multiclass model with the multi:softprob objective to multiple GPUs. While this objective is listed as supported on GPU, it appears as if the metric functions required (mlogloss or merror) for multiclass training are not supported.

The result seems to be that multiclass training on multiple GPUs is not actually supported. Is this correct? If so, is support coming soon?

Here are some details of what I’ve tried:

  • I can train multiclass models on a single GPU on a multiple GPU machine using a single GPU.
  • It is possible to run multi GPU multiclass training with mlogloss or merror metrics, however training is unstable and blows up after a few iterations. (though it doesn’t crash on the GPU; it quickly finishes all iterations and then crashes the next time I try to call Booster.train in the program with this error Check failed: distribution_.IsEmpty() || distribution.IsEmpty()).
  • I’ve tried removing all evals sets (training set only) from the training in case it is a multi-prediction issue (still blows up)
  • Passing no explicit evals metrics, or binary evals metrics such as error results in a failure similar to Check failed: preds.size() == info.labels_.Size() (1500000 vs. 500000) label and prediction size not match, hint: use merror or mlogloss for multi-class classification

XGB version is 0.81, compiled from source on ubuntu 16.04 (in a container). CUDA and NCCL were installed via these deb packages from nvidia’s repository: cuda-toolkit-10-0 libnccl2=2.4.2-1+cuda10.0 libnccl-dev=2.4.2-1+cuda10.0

Thanks for your help,
-Jeff


#2

@jiaming Do we actually support multi-class classification on multiple GPUs?