Training stuck for no apparent reason

yclaes · December 22, 2021, 10:15pm

Hello,

I am trying to train an XGBoostRegressor (using the sklearn wrappers) for a given problem, with a custom objective function. Training starts correctly as the log starts displaying information about the tree growing procedure. However, it gets stuck after a certain time (a certain number of rounds) and I am not able to figure out why.

To be clear here is an example of the log:

[11:23:26] DEBUG: /private/var/folders/p7/5z1vzvf979z_4229133pyj800000gn/T/pip-install-5csor64l/xgboost_f54291a5282e44bda2bc887c27db68bb/build/temp.macosx-10.14.6-arm64-3.8/xgboost/src/gbm/gbtree.cc:155: Using tree method: 2

[11:23:26] INFO: /private/var/folders/p7/5z1vzvf979z_4229133pyj800000gn/T/pip-install-5csor64l/xgboost_f54291a5282e44bda2bc887c27db68bb/build/temp.macosx-10.14.6-arm64-3.8/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 0 extra nodes, 0 pruned nodes, max_depth=0

[11:24:36] INFO: /private/var/folders/p7/5z1vzvf979z_4229133pyj800000gn/T/pip-install-5csor64l/xgboost_f54291a5282e44bda2bc887c27db68bb/build/temp.macosx-10.14.6-arm64-3.8/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 0 extra nodes, 0 pruned nodes, max_depth=0

…

[12:01:21] INFO: /private/var/folders/p7/5z1vzvf979z_4229133pyj800000gn/T/pip-install-5csor64l/xgboost_f54291a5282e44bda2bc887c27db68bb/build/temp.macosx-10.14.6-arm64-3.8/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 2 extra nodes, 0 pruned nodes, max_depth=1

and it gets stuck after this iteration. For debugging purpose, I have set max_depth=1, n_estimators=100, and have not set any value for n_jobs, as I thought it could be the cause of the problem. My dataset has 100k records of 18 features each, which does not appear dramatically large. I let it run for hours but training just doesn’t want to finish. I executed the same code on a cluster and it did finish there, although it crashed for some unknown pickle reason.

Any tip would be appreciated, thanks!

PS: to add something about the pickle error received on the cluster, it tells me

_pickle.PicklingError: Can’t pickle <function norm_objective_wrapper..norm_objective at 0x14661a0ce950>: it’s not found as main.norm_objective_wrapper..norm_objective

when I try to joblib.dump() my model. I guess it can’t have anything to do with multi-processing (as was the case for other people) since I’m not taking any advantage of parallelism (and basically using a single cpu for debugging purpose).