I try to implement the external memory version of XGBoost in R.
I do succeed to implement the example from r-bloggers in the 3rd alinea from this post:
In this example the train data is read as a file from disk directly in the xgb.DMatrix function like this:
dtrain = xgb.DMatrix(‘agaricus.txt.train#train.cache’)
The train data on disk is in LIBSVM format.
In my own set-up my train-data is a dataframe in memory and processed in this way:
trainval <- parts_inner %>% dplyr::filter(.folds != i) %>% dplyr::select(-.folds) trainval_onehot <- model.matrix(~maatfactor-1, trainval) trainval$maatfactor <- NULL trainval <- data.matrix(trainval) trainval <- cbind(trainval, trainval_onehot) trainval_targets <- parts_inner %>% dplyr::filter(.folds != i) %>% dplyr::select(WinFlag) trainval_targets <- data.matrix(trainval_targets) xgb_trainval <- xgb.DMatrix(data = trainval, label = trainval_targets)
model_n <- xgb.train(data = xgb_trainval, tree_method = "gpu_hist", booster = "gbtree", objective = "binary:logistic", eval_metric = "auc", watchlist = list(train = xgb_trainval, val = xgb_val) )
I am in doubt how to convert my data in the right format.
Moreover there is a one-hot encoded factor (named maatfactor) in my data.
My plan is to:
- merge trainval_targets with trainval.
- convert it to LIBSVM with a converter
- save it to disk
- read it with:
xgb_trainval2 = xgb.DMatrix(‘C:/my_directory/data_file#train.cache’)
Model it with:
model_n <- xgb.train(data = xgb_trainval2, tree_method = "gpu_hist", booster = "gbtree", objective = "binary:logistic", eval_metric = "auc", watchlist = ? )
I don’t know what I should use for watchlist.
Would this be the right way to go?
Thanks a lot!