I’m working in a large CPython project, that mostly uses CPython 3.6 or CPython 3.7.
It also uses XGBoost 0.9 in combination with scikit-learn 0.21.3.
I’ve been getting (sometimes!) some nan’s that result in a traceback from a “simple” pytest that has a kinda huge callstack beneath it. The pytest feeds in random values.
I’ve put an SSCCE, sans input files (at least for now), at http://stromberg.dnsalias.org/svn/xgboost-predict-nans/trunk/ttt-sscce . I’m also pasting the same thing immediately below:
#!/usr/bin/python3.6 """ An SSCCE for our NaN issue in XGBoost. This is with regard to Grokstream issue RM-454, the test_train transient error. This script fails every time, and completes quickly. The matter could easily be an input problem rather than an XGBoost bug. """ import xgboost.sklearn import xgboost.core import numpy def main(): """Replicate.""" classifier = xgboost.sklearn.XGBClassifier() classifier.load_model('xgboost-sklearn-model-file') booster = classifier.get_booster() test_dmatrix = xgboost.core.DMatrix('test-dmatrix') class_probs = booster.predict(test_dmatrix, ntree_limit=0, validate_features=True) print(class_probs) if all(numpy.isfinite(class_probs)): print('Good, all values are finite.') else: raise SystemExit('Uh oh, one or more values are not finite.') main()
Is there any way of telling, without the inputs, why this is giving all nan’s?
Hopefully on Tuesday my employer will be able to make a final decision on whether I can share the two input files here.
PS: I asked about how to print a DMatrix at Is there a way to print a DMatrix as ASCII or JSON?
The question also arises: Is there a way of printing an XGBoost XGBModel from CPython as ASCII or JSON?
I’m hoping these two print operations will allow me (and possibly someone more familiar with the algorithms involved) to scan the two input files for bad values.