I encountered a problem when training an incremental xgboost model. If my incremental dataset contains samples in every category, incremental training works well. Anyhow, if the dataset does not contain samples from every category, the training fails.
Minimal reproducible example (python, xgboost version=1.4.2):
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split as ttsplit import xgboost as xgb X = load_iris()['data'] y = load_iris()['target'] # split data into training and testing sets # then split training set in half for base model and incremental model X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0) X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit( X_train, y_train, test_size=0.5, random_state=0) clf = xgb.XGBClassifier(use_label_encoder=False) clf.fit(X_train_1, y_train_1) # Artificially remove one group from the labels to showcase behavior y_train_2[y_train_2 == 0] = 1 clf2 = xgb.XGBClassifier(use_label_encoder=False) clf2.fit(X_train_2, y_train_2, xgb_model=clf)
ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class - 1].
This is expected behavior with a fresh training from scratch, but it’s not convenient to expect the incremental dataset to have samples in every category.
I read that at least in the past there has been a problem with .fit() ignoring xgb_model (https://github.com/dmlc/xgboost/issues/3297#issuecomment-423054809). I also have read some posts that incremental learning is not actually what we have learned incremental learning to be. (https://github.com/dmlc/xgboost/issues/3055)
- What’s the verdict, can I incrementally train a model with newly acquired smaller dataset (goal: the model is slightly adjusted to give better results after incremental training)
- How can I avoid the error when my incremental dataset does not contain samples from every category?
- Is there documentation for incremental learning with different APIs?