NaN values with early stopping

mangleddata · October 13, 2021, 3:02pm

I have a dataset that has small number of rows (~10) and large number of features (~100). I use early stopping and CV and keep getting the following error. Dataset does not contain nan. I have tried tweaking many parameters - but still keep getting this error. Any idea on how to resolve this ?

0%| | 0/100 [22:09<?, ?trial/s, best loss=?]/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/lib/python3.7/cmd.py”, line 214, in onecmd
func = getattr(self, 'do’ + cmd)
AttributeError: ‘Pdb’ object has no attribute ‘do_score’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py”, line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/core.py”, line 436, in inner_f
return f(**kwargs)
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/sklearn.py”, line 1187, in fit
callbacks=callbacks,
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/training.py”, line 197, in train
early_stopping_rounds=early_stopping_rounds)
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/training.py”, line 76, in _train_internal
bst = callbacks.before_training(bst)
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/callback.py”, line 376, in before_training
model = c.before_training(model=model)
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/callback.py”, line 515, in before_training
self.starting_round = model.num_boosted_rounds()
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/core.py”, line 2007, in num_boosted_rounds
_check_call(_LIB.XGBoosterBoostedRounds(self.handle, ctypes.byref(rounds)))
File “/home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/core.py”, line 210, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [14:53:47] …/src/metric/metric.cc:49: Unknown metric function l
Stack trace:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x9133f) [0x7f6014a5f33f]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1fcd0f) [0x7f6014bcad0f]
[bt] (2) /home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1d2378) [0x7f6014ba0378]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterBoostedRounds+0x1a) [0x7f6014a4d39a]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/…/…/libffi.so.7(+0x69dd) [0x7f604d1f79dd]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/…/…/libffi.so.7(+0x6067) [0x7f604d1f7067]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2f4) [0x7f604bc86794]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x10ff8) [0x7f604bc86ff8]
[bt] (8) python(_PyObject_FastCallKeywords+0x48b) [0x56210a01ea5b]

FitFailedWarning)

hcho3 · October 13, 2021, 6:23pm

…/src/metric/metric.cc:49: Unknown metric function l

Can you check the hyperparameters? It appears that you are trying to use a metric that does not exist in XGBoost.

mangleddata · October 13, 2021, 8:33pm

For whatever reason, the errors don’t make much sense because the same code with same hyperparameters work for many other data sets. My snippet looks like below. I do have a very small dataset with lot of features but I don’t know why the score would all be nan. I’ve played with different parameters, but do end up with similar failure.

 # trainer_params = {'learning_rate': '0.040', 'n_estimators': 110, 'max_depth': 7, 'colsample_bytree': '0.700',  'subsample': '0.700', 'min_child_weight': '3.000', 'gamma': '3.000', 'reg_lambda': '10.000', 'reg_alpha': '4.000', 'tree_method': 'hist'}
# fit_params=  {'early_stopping_rounds': 1, 'eval_metric': 'logloss', 'verbose': 3, 'eval_set': [[array([[0., 0., 0., ..., 0., 0., 0.],...SNIPPED..               
# eval_metric = "logloss"

clf = xgb.XGBClassifier(nthread = 1,use_label_encoder=False,**trainer_params)
# This fails
score = cross_val_score(clf, x_train, y_train, cv = 2, verbose = 3, scoring = cross_val_scoring, fit_params = fit_params)

# This works (using same dataset for early stopping and training)
score = cross_val_score(clf, x_valid, y_valid, cv = 2, verbose = 3, scoring = cross_val_scoring, fit_params = fit_params)

print(score)

mangleddata · October 13, 2021, 9:04pm

Here’s a minimalist repro w/ extreme # of features.

Appreciate any pointers!

jiamingy · October 21, 2021, 6:25am

Related: https://github.com/dmlc/xgboost/issues/6735

Scikit-learn somehow treats “mlogloss” as indexable data and splits it up.

mangleddata · October 25, 2021, 4:41am

Thanks for looking into it. Are there any best practices to deal when there are small number of training rows compared to number of features ? Wonder if there are any xgboost specific tricks to reduce model complexity.

jiamingy · October 25, 2021, 8:43am

You can look at the SHAP values or global feature importance from the trained model and select the features that are important, then train the model again with removed features or with column sampling and feature weights. I think there are many techniques and some literature around using tree models for feature selection. Feel free to post your discovery.