Still getting unexplained NaN's, new replication code

dan-stromberg · January 31, 2020, 5:21pm

Hi again.

I have another SSCCE related to XGBoost. Below are two ways of obtaining it:

Just browse to https://pastebin.com/fsKctYtb and follow the link there.
Browse to https://pastebin.com/gkL2eKXF - to extract:

Download the data at the pastebin URL
echo >> 5526giJW.txt
uudecode 5526giJW.txt
tar tvJf RM-454-SSCCE-5.tar.xz
tar xvJf RM-454-SSCCE-5.tar.xz

BTW, I tried to post a link to my website with this stuff, but the forum told me I was a spammer and to go away

Unfortunately, this SSCCE is not entirely predictable. The majority of the time it works fine, but some smallish percentage of the time it gets NaN’s that prevent it from working correctly - despite being given the same inputs each time at the CPython level.

Consequently the SSCCE runs its replicate() function a number of times, and keeps track of what percentage of the runs completed successfully. Spoiler: I don’t think it’s ever 100% unless you request a pretty small number of runs. 30 or 100 are generally enough to get some bad runs - the script is currently hardcoded to just use 30.

Any suggestions? Are we feeding it something bogus? And is there a way of holding constant the seeds to the random number generators at the numpy and/or C++ level, like I’ve already done at the CPython level?

Thanks!

PS: It has problems about 17% of the time.

dan-stromberg · January 31, 2020, 7:52pm

Is there some way I can run the same test without XGBoost’s async and/or parallel features? It might be informative to see if that helps.

hcho3 · January 31, 2020, 8:00pm

@dan-stromberg FYI, I am currently working on 1.0.0 release: https://github.com/dmlc/xgboost/issues/5253. I will see if I can reproduce the problem with the Release Candidate.

hcho3 · January 31, 2020, 11:56pm

@dan-stromberg I managed to reproduce the problem with the Release Candidate.

Curiously, the bug disappears when I add another training hyperparameter tree_method='approx'.

hcho3 · February 1, 2020, 12:10am

Update: I found the root cause. The prediction function of XGBoost is producing a bunch of NaN’s. Here is the line that produces NaN’s:

github.com

dmlc/xgboost/blob/fe8d72b50b132af6e24dfa6eb2e08d18430247e8/python-package/xgboost/sklearn.py#L854-L858


class_probs = self.get_booster().predict(
    test_dmatrix,
    output_margin=output_margin,
    ntree_limit=ntree_limit,
    validate_features=validate_features)

Curiously, this problem only occurs when we set the hyperparameter tree_method to either exact or hist.

dan-stromberg · February 1, 2020, 12:10am

That’s great news

hcho3 · February 1, 2020, 12:42am

Finally: Tree 30 is a stump consisting a single leaf node, and that leaf node produces NaN score.

At the very least, we should insert NaN checks in a few places so that we can catch the problem earlier.

So instead of the cryptic error message

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

we should instead see the error message

Something went wrong with training. A leaf node (node id 0) of Tree 30 outputs NaN.
**********************************************************************************
Full model dump of Tree 30
----------------------------------------------------------------------------------
0:leaf=-nan
**********************************************************************************

And hopefully, we can figure out why leaf values are getting set to NaN’s.

I will prepare a pull request to add NaN checks.

hcho3 · February 1, 2020, 1:34am

https://github.com/dmlc/xgboost/pull/5258 catches two birds with one stone: better error message about NANs and a fix for a particular degenerate case that led to presence of NANs. I appreciate you taking time to report the bug.

hcho3 · February 1, 2020, 8:50am

@dan-stromberg It appears that the bug is a little deeper than I thought. For now, you can work around the bug by setting a strictly non-zero value for the hyperparameter reg_lambda.

dan-stromberg · February 4, 2020, 3:43pm

Thanks!

(I was out sick yesterday)

dan-stromberg · February 4, 2020, 4:18pm

Actually with:
estimator = XGBClassifier(
base_score=0.5, booster=‘gbtree’, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=1, objective=‘binary:logistic’, random_state=42,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1, verbosity=1,
)

(get_lambda set to 1)
…I still get NaN’s sometimes.

But setting tree_method=-‘approx’ appears to be an effective workaround: with that I got 1000 good runs out of 1000.