Model does not train on data in external memory

Hello,

I am running XGBoost v0.82 over a large dataset, so I would like to use the external memory functionality. I’ve converted my input data to libsvm format and loaded it into a dmatrix following the example here. However, the error printed out at each round does not change at all, and the final roc auc is 0.5. When I train without using #dtrain.cache, the training works normally.

Any ideas what’s happening here? I have posted the relevant snippet of code below, but unfortunately cannot share the data.

Thanks,
Jennet

params={'objective':'binary:logistic',                                                                                            
        'nthread':4,
        'max_depth':max_depth,
        'min_child_weight':min_child_weight,
        'gamma':gamma,
        'n_estimators':num_rounds,
        'eta':eta,
        'subsample':subsample,
        'colsample_bytree':colsample_bytree}                                                                        

# Matrices kept in EXTERNAL MEMORY                                                                                             
dtrain = xgb.DMatrix("train.dat#dtrain.cache")
dtest  = xgb.DMatrix("test.dat#dtest.cache")

# Fit the algorithm                                                                                                            
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
myboost = xgb.train(params,dtrain,num_rounds,watchlist)

# Predict training set:                                                                                                        
dtrain_predictions = myboost.predict(dtrain)
dtest_predictions = myboost.predict(dtest)

One thing I’ve noticed is that cache files might not get cleaned up between runs.

Have you tried manually deleting those files (should have names like dtrain.cache.row.page and be under your home folder) and running again?

How big are the data? I ran a toy example with RCV1 and training happens fine. So it might be an edge case because of data size.

Hi there,

I realize that I forgot to introduce the proper weights for my data in the external memory setup. Adding files dtrain.dat.weight and dtest.dat.weight then setting the parameter ‘scale_pos_weight’ seems to have solved the issue.

Sorry for the noise!