Severely decreased performance when multiple xgboost processes running

jefshe · March 30, 2020, 4:15am

Hi all,

We’re currently running xgboost & openmp in a production environment via the R package. See: https://github.com/Displayr/flipMultivariates/blob/master/R/gradientboost.R#L147 for how we’re calling it.

We’ve noticed that if multiple xgboost processes are running at the same time we get horrible runtime performance. I observe that the default xgboost behaviour is to spawn as many threads as their are cores on the machine. This causes a lot of contention. When running 4 xgboost processes on a 16 core machine:

I get slow downs of a factor of 30x

Is there a way to customize this behaviour? I saw that that openmp has a OMP_DYNAMIC flag but it doesn’t seem to work.

Has anyone else encountered such problems?

jefshe · March 30, 2020, 4:29am

I observe lots of threads just busy waiting (threads calling do_spin()) around these openmp loops:

github.com

dmlc/xgboost/blob/master/src/tree/updater_colmaker.cc#L567


    fsplits.push_back(tree[nid].SplitIndex());
  }
}
std::sort(fsplits.begin(), fsplits.end());
fsplits.resize(std::unique(fsplits.begin(), fsplits.end()) - fsplits.begin());
for (const auto &batch : p_fmat->GetBatches<SortedCSCPage>()) {
  for (auto fid : fsplits) {
    auto col = batch[fid];
    const auto ndata = static_cast<bst_omp_uint>(col.size());
#pragma omp parallel for schedule(static)
    for (bst_omp_uint j = 0; j < ndata; ++j) {
      const bst_uint ridx = col[j].index;
      const int nid = this->DecodePosition(ridx);
      const bst_float fvalue = col[j].fvalue;
      // go back to parent, correct those who are not default
      if (!tree[nid].IsLeaf() && tree[nid].SplitIndex() == fid) {
        if (fvalue < tree[nid].SplitCond()) {
          this->SetEncodePosition(ridx, tree[nid].LeftChild());
        } else {
          this->SetEncodePosition(ridx, tree[nid].RightChild());
        }

github.com

dmlc/xgboost/blob/master/src/tree/updater_colmaker.cc#L521


                        const RegTree& tree) {
// set the positions in the nondefault
this->SetNonDefaultPosition(qexpand, p_fmat, tree);
// set rest of instances to default position
// set default direct nodes to default
// for leaf nodes that are not fresh, mark then to ~nid,
// so that they are ignored in future statistics collection
const auto ndata = static_cast<bst_omp_uint>(p_fmat->Info().num_row_);


#pragma omp parallel for schedule(static)
for (bst_omp_uint ridx = 0; ridx < ndata; ++ridx) {
  CHECK_LT(ridx, position_.size())
      << "ridx exceed bound " << "ridx="<<  ridx << " pos=" << position_.size();
  const int nid = this->DecodePosition(ridx);
  if (tree[nid].IsLeaf()) {
    // mark finish when it is not a fresh leaf
    if (tree[nid].RightChild() == -1) {
      position_[ridx] = ~nid;
    }
  } else {
    // push to default branch

jefshe · March 30, 2020, 10:35am

Majority of threads are spent busy waitng. Anyway to cutdown on that?

hcho3 · March 31, 2020, 9:29am

Try setting environment variable OMP_NUM_THREADS to value 1. This should force OpenMP runtime to use a single thread per XGBoost process.