Hello,
my data set is too big to fit in RAM and therefore I have to use the external memory. This implies I’d either have to use DMatrix (which I do, currently) or write my own data provider.
The problem is that I’d like to optimize hyperparameters, preferably using the sklearn’s API. Unfortunately, the XGBClassifier expects X and Y arrays and not the DMatrix.
Did anyone solve such issue?
Here’s MWE with a tiny, fake data set:
import numpy as np
import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
if __name__ == '__main__':
clf = xgb.XGBClassifier()
clf = GridSearchCV(
clf,
{
'n_estimators': [2, 3],
'max_depth': [1, 2]
},
cv=2
)
x = np.concatenate(
[np.random.uniform(-1, 1, 100).reshape(-1, 1), np.random.random_integers(0, 2, 100).reshape(-1, 1)],
axis=1
)
y = np.random.random_integers(0, 1, 100)
# NUMPY ARRAY
model = clf.fit(x, y)
test_x = np.array([[-.32, 2], [.44, 1], [.96, 2], [-.01, 1], [.45, 0]])
test_y = np.array([0, 0, 1, 1, 1])
test_predicted = model.predict(test_x)
cm = confusion_matrix(test_y, test_predicted)
# DMATRIX
dmatrix = xgb.DMatrix(x, y)
# clf.???