XBG Regression gives very different predictions with little extra training data

apapiu · May 2, 2019, 7:20pm

I am running an XGBRegressor model to predict TV viewership based on the past 3-4 years of viewing behavior.

I run the model every day - so every day I add training examples as new data comes in and then retrain the model.

I have noticed that the predictions change quite significantly day over day when I am predicting on the same dataset (sometimes as high as +/- 20-30%). This seems a bit odd since I am only adding one day of data (so the equivalent of changing ~0.1% of the entire training set).

I understand trees are are local models and inherently unstable but is there any way to make the xgboost regression model more stable/robust to small changes in the training data?

hcho3 · May 3, 2019, 12:03am

Do you retrain your model on (old data) + (new data)?

Check if your model is overfitting. One good way is to run K-fold cross validation or Leave-One-Out cross validation.

apapiu · May 3, 2019, 2:59pm

Hey thanks for the answer, the training schema is approximately as follows:

Model 1 - trained on 1000 days of data (around 60k datapoints)
Model 2 - trained on 1001 days of data (1000 original days + 1 new day)

Model 1 and 2 have the same params.

But the percentage difference (MAPE) between predictions from model 1 and model 2 can be as high as 30%.

I tuned the parameters using time-based splits so I don’t think the model overfits too much. I did try decreasing the number of trees, decreasing the learning rate or increasing the subsampling. But these actually lead to an increase in percentage difference probably because the models did not converge.

hcho3 · May 3, 2019, 5:20pm

Does your data have a “time” component?

The theory behind XGBoost assumes i.i.d. (independent, identically distributed) when it comes to the training data. If your data is not i.i.d. (say, because of autocorrelation), XGBoost may not produce optimal results. You should inspect your data and make sure your data is i.i.d.

hcho3 · May 3, 2019, 5:20pm

According to https://www.quora.com/What-will-happen-for-the-machine-learning-algorithms-if-the-data-samples-are-not-i-i-d:

for non-independent samples scenario, most predictive models would have high variance … the performance of predictive model would highly depend on the training data chosen, such that the performance would change significantly on changing the training data.

hcho3 · May 3, 2019, 5:25pm

Also see this helpful discussion: https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/38352