Still getting unexplained NaN's, new replication code


#1

Hi again.

I have another SSCCE related to XGBoost. Below are two ways of obtaining it:

  1. Just browse to https://pastebin.com/fsKctYtb and follow the link there.

  2. Browse to https://pastebin.com/gkL2eKXF - to extract:

  • Download the data at the pastebin URL
  • echo >> 5526giJW.txt
  • uudecode 5526giJW.txt
  • tar tvJf RM-454-SSCCE-5.tar.xz
  • tar xvJf RM-454-SSCCE-5.tar.xz

BTW, I tried to post a link to my website with this stuff, but the forum told me I was a spammer and to go away :slight_smile:

Unfortunately, this SSCCE is not entirely predictable. The majority of the time it works fine, but some smallish percentage of the time it gets NaN’s that prevent it from working correctly - despite being given the same inputs each time at the CPython level.

Consequently the SSCCE runs its replicate() function a number of times, and keeps track of what percentage of the runs completed successfully. Spoiler: I don’t think it’s ever 100% unless you request a pretty small number of runs. 30 or 100 are generally enough to get some bad runs - the script is currently hardcoded to just use 30.

Any suggestions? Are we feeding it something bogus? And is there a way of holding constant the seeds to the random number generators at the numpy and/or C++ level, like I’ve already done at the CPython level?

Thanks!

PS: It has problems about 17% of the time.


#2

Is there some way I can run the same test without XGBoost’s async and/or parallel features? It might be informative to see if that helps.


#3

@dan-stromberg FYI, I am currently working on 1.0.0 release: https://github.com/dmlc/xgboost/issues/5253. I will see if I can reproduce the problem with the Release Candidate.


#4

@dan-stromberg I managed to reproduce the problem with the Release Candidate.

Curiously, the bug disappears when I add another training hyperparameter tree_method='approx'.


#5

Update: I found the root cause. The prediction function of XGBoost is producing a bunch of NaN’s. Here is the line that produces NaN’s:

Curiously, this problem only occurs when we set the hyperparameter tree_method to either exact or hist.


#6

That’s great news :slight_smile:


#7

Finally: Tree 30 is a stump consisting a single leaf node, and that leaf node produces NaN score.

At the very least, we should insert NaN checks in a few places so that we can catch the problem earlier.

So instead of the cryptic error message

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

we should instead see the error message

Something went wrong with training. A leaf node (node id 0) of Tree 30 outputs NaN.
**********************************************************************************
Full model dump of Tree 30
----------------------------------------------------------------------------------
0:leaf=-nan
**********************************************************************************

And hopefully, we can figure out why leaf values are getting set to NaN’s.

I will prepare a pull request to add NaN checks.


#8

https://github.com/dmlc/xgboost/pull/5258 catches two birds with one stone: better error message about NANs and a fix for a particular degenerate case that led to presence of NANs. I appreciate you taking time to report the bug.


#9

@dan-stromberg It appears that the bug is a little deeper than I thought. For now, you can work around the bug by setting a strictly non-zero value for the hyperparameter reg_lambda.


#10

Thanks!

(I was out sick yesterday)


#11

Actually with:
estimator = XGBClassifier(
base_score=0.5, booster=‘gbtree’, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=1, objective=‘binary:logistic’, random_state=42,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1, verbosity=1,
)

(get_lambda set to 1)
…I still get NaN’s sometimes.

But setting tree_method=-‘approx’ appears to be an effective workaround: with that I got 1000 good runs out of 1000.