@hcho3 Ok.
By the way, I tested the time performance of predicting using XGBClassifier and XGBRFClassifier, and compared these with sklearn’s RandomForestClassifier and ExtraTreesClassifier, and I found that both xgboost methods cost more time than both sklearn’s methods; note that both xgboost methods use GPU and much smaller ‘max_depth’, and all methods use the same ‘n_estimators’ value.
Here is the test code:
import xgboost as xgb
import numpy
from sklearn import ensemble
import time
# read in data
num_samps = 2000 * 10
num_feats = 2000
num_samps_test = 4000 * 100
X_train = numpy.vstack( [numpy.random.rand( num_samps // 2, num_feats ),
numpy.random.rand( num_samps // 2, num_feats ) + 1
] )
Y = numpy.array( [0] * (num_samps // 2) + [1] * (num_samps // 2) )
X_test = numpy.vstack( [numpy.random.rand( num_samps_test // 2, num_feats ),
numpy.random.rand( num_samps_test // 2, num_feats ) + 1
] )
X_train = X_train.astype( numpy.float32 )
Y = Y.astype( numpy.float32 )
X_test = X_test.astype( numpy.float32 )
clf_rf = xgb.XGBRFClassifier( n_estimators = 100, max_depth = 10,
tree_method = 'gpu_hist', gpu_id = 0, max_bin = 16,
verbosity = 0 ) #updater = 'grow_gpu' )
print( 'training XGBRFClassifier...' )
clf_rf.fit(X_train, Y)
print( 'predicting...' )
t1 = time.time()
probs_pred = clf_rf.predict_proba( X_test )
t2 = time.time()
#print( probs_pred )
print 'time cost: ', t2 - t1, ' secs'
clf_rf_1 = ensemble.RandomForestClassifier( n_estimators = 100, max_depth = 100 )
print 'training RandomForestClassifier...'
clf_rf_1.fit(X_train, Y)
print 'predicting...'
t1 = time.time()
probs_pred_1 = clf_rf_1.predict_proba( X_test )
t2 = time.time()
print 'time cost: ', t2 - t1, ' secs'
clf_xgb = xgb.XGBClassifier( n_estimators = 100, max_depth = 10, tree_method = 'gpu_hist',
gpu_id = 0, max_bin = 16, objective = 'binary:logistic' )
print 'training XGBClassifier...'
clf_xgb.fit(X_train, Y)
print 'predicting...'
t1 = time.time()
probs_pred_2 = clf_xgb.predict_proba(X_test)
t2 = time.time()
print 'time cost: ', t2 - t1, ' secs'
clf_et = ensemble.ExtraTreesClassifier( n_estimators = 100, max_depth = 100 )
print 'training ExtraTreesClassifier...'
clf_et.fit(X_train, Y)
print 'predicting...'
t1 = time.time()
probs_pred_3 = clf_et.predict_proba( X_test )
t2 = time.time()
print 'time cost: ', t2 - t1, ' secs'
And, here is the output on my machine:
training XGBRFClassifier…
predicting…
time cost: 6.31288981438 secs
training RandomForestClassifier…
predicting…
time cost: 3.91895699501 secs
training XGBClassifier…
Exception AttributeError: “‘NoneType’ object has no attribute ‘XGBoosterFree’” in <bound method Booster.del of <xgboost.core.Booster object at 0x7f6e66204fd0>> ignored
predicting…
time cost: 6.06001496315 secs
training ExtraTreesClassifier…
predicting…
time cost: 5.28508210182 secs
I’m wondering if this result is valid, and is there a way to give the xgboost methods a better performance? Thanks!