How to implement the external memory version in R


I try to implement the external memory version of XGBoost in R.

I do succeed to implement the example from r-bloggers in the 3rd alinea from this post:

In this example the train data is read as a file from disk directly in the xgb.DMatrix function like this:
dtrain = xgb.DMatrix(‘agaricus.txt.train#train.cache’)

The train data on disk is in LIBSVM format.

In my own set-up my train-data is a dataframe in memory and processed in this way:

 trainval <- parts_inner %>%
    dplyr::filter(.folds != i) %>%
  trainval_onehot <- model.matrix(~maatfactor-1, trainval)
  trainval$maatfactor <- NULL
  trainval <- data.matrix(trainval)
  trainval <- cbind(trainval, trainval_onehot)
  trainval_targets <- parts_inner %>%
    dplyr::filter(.folds != i) %>%
  trainval_targets <- data.matrix(trainval_targets)
  xgb_trainval <- xgb.DMatrix(data = trainval, label = trainval_targets)
  model_n <- xgb.train(data = xgb_trainval,
                       tree_method = "gpu_hist",
                       booster = "gbtree",
                       objective = "binary:logistic",
                       eval_metric = "auc",
                       watchlist = list(train = xgb_trainval, val = xgb_val)

I am in doubt how to convert my data in the right format.
Moreover there is a one-hot encoded factor (named maatfactor) in my data.

My plan is to:

  1. merge trainval_targets with trainval.
  2. convert it to LIBSVM with a converter
  3. save it to disk
  4. read it with:

xgb_trainval2 = xgb.DMatrix(‘C:/my_directory/data_file#train.cache’)

Model it with:

  model_n <- xgb.train(data = xgb_trainval2,
                       tree_method = "gpu_hist",
                       booster = "gbtree",
                       objective = "binary:logistic",
                       eval_metric = "auc",
                       watchlist = ?

I don’t know what I should use for watchlist.

Would this be the right way to go?

Thanks a lot!