External memory and hyperparameters

Hello,

my data set is too big to fit in RAM and therefore I have to use the external memory. This implies I’d either have to use DMatrix (which I do, currently) or write my own data provider.

The problem is that I’d like to optimize hyperparameters, preferably using the sklearn’s API. Unfortunately, the XGBClassifier expects X and Y arrays and not the DMatrix.

Did anyone solve such issue?

Here’s MWE with a tiny, fake data set:

import numpy as np
import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

if __name__ == '__main__':
    clf = xgb.XGBClassifier()
    clf = GridSearchCV(
        clf,
        {
            'n_estimators': [2, 3],
            'max_depth': [1, 2]
        },
        cv=2
    )

    x = np.concatenate(
        [np.random.uniform(-1, 1, 100).reshape(-1, 1), np.random.random_integers(0, 2, 100).reshape(-1, 1)],
        axis=1
    )
    y = np.random.random_integers(0, 1, 100)

    # NUMPY ARRAY
    model = clf.fit(x, y)

    test_x = np.array([[-.32, 2], [.44, 1], [.96, 2], [-.01, 1], [.45, 0]])
    test_y = np.array([0, 0, 1, 1, 1])
    test_predicted = model.predict(test_x)

    cm = confusion_matrix(test_y, test_predicted)

    # DMATRIX
    dmatrix = xgb.DMatrix(x, y)
    # clf.???

I don’t think it’s currently possible using the sklearn API.

Could you use the xgboost.cv method? Link to docs: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training

I thought about using sklearn.model_selection.RandomizedSearchCV or sklearn.model_selection.GridSearchCV and pass xgb.XGBoostClassifier as a model, then fit it with xgb.DMatrix which, I’m affraid is impossible. At least that was something I found intuitive.

Do you suggest me to just use the xgboost.cv method multiple times with different model parameters?

Hmm good point it still won’t do what you want it to do.

What if you wrap the XGB learner and override the fit function something like (untested code):

class XGBSKWrapper(object):
  def __init__(self):
    self.booster = None

  def fit(self, X, y):
    dmatrix = DMatrix(X, y)
    self.booster = xgb.train(dmatrix, ...)

  def score(self, X, y):
    # Return the booster score here...

then you pass that object to the CV function. AFAIK the CV expects the estimator to provide the fit and score functions only.

See sklearn’s docs as well: User Guide and GirdSearchCV

Edit: This will require loading X, y into memory so I don’t know if it will help your case. Maybe numpy.memmap could help but my guess is something would break.

That wouldn’t work for me, unfortunately. Thanks for your commitment.

I’m going to use hyperopt along xgb.cv.