XGBoost time-series predictions question


#1

I have created a model in Python, but I don’t understand how to use it for predictions. For e.g. FB Prophet allows to set number of steps to predict. Could you please tell - what code should I run in order to predict 5 steps ahead with XGBoost?

I have a model built and evaluated it, I just need to understand how to use it.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
plt.style.use(‘fivethirtyeight’)

dfs = pd.read_csv(‘F:\TDG\Analysts\Ops Analyst\Files\885 OCtober 2016+ Daily.csv’, index_col=[0], parse_dates=[0])

split_date = ‘1/1/2018’
dfs_train = dfs.loc[dfs.index <= split_date].copy()
dfs_test = dfs.loc[dfs.index > split_date].copy()

_ = dfs_test
.rename(columns={‘y’: ‘TEST SET’})
.join(dfs_train.rename(columns={‘y’: ‘TRAINING SET’}), how=‘outer’)
.plot(figsize=(15,5), title=‘data’, style=’.’)

def create_features(df, label=None):
“”"
Creates time series features from datetime index
“”"
df[‘date’] = df.index
df[‘hour’] = df[‘date’].dt.hour
df[‘dayofweek’] = df[‘date’].dt.dayofweek
df[‘quarter’] = df[‘date’].dt.quarter
df[‘month’] = df[‘date’].dt.month
df[‘year’] = df[‘date’].dt.year
df[‘dayofyear’] = df[‘date’].dt.dayofyear
df[‘dayofmonth’] = df[‘date’].dt.day
df[‘weekofyear’] = df[‘date’].dt.weekofyear

X = df[['hour','dayofweek','quarter','month','year',
       'dayofyear','dayofmonth','weekofyear']]
if label:
    y = df[label]
    return X, y
return X

X_train, y_train = create_features(dfs_train, label=‘y’)
X_test, y_test = create_features(dfs_test, label=‘y’)

reg = xgb.XGBRegressor(n_estimators=1000)
reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=False) # Change verbose to True if you want to see it train

_ = plot_importance(reg, height=0.9)

Forecast on Test Set

dfs_test[‘y_Prediction’] = reg.predict(X_test)
dfs_all = pd.concat([dfs_test, dfs_train], sort=False)

_ = dfs_all[[‘y’,‘y_Prediction’]].plot(figsize=(15, 5))

mean_squared_error(y_true=dfs_test[‘y’],
y_pred=dfs_test[‘y_Prediction’])

mean_absolute_error(y_true=dfs_test[‘y’],
y_pred=dfs_test[‘y_Prediction’])

def mean_absolute_percentage_error(y_true, y_pred):
“”“Calculates MAPE given y_true and y_pred”""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mean_absolute_percentage_error(y_true=dfs_test[‘y’],
y_pred=dfs_test[‘y_Prediction’])


#2

XGBoost assumes i.i.d., so I’m not sure if XGBoost is right for time series data (where feature is time-dependent)


#3

I think this is probably obvious to many who are better (and more recently) trained.

But XGBoost can work well, I think, if you make the data stationary. Specifically, I have had good success when I de-trend and difference the data. Taking the natural log of my data seems to help also.

Even after doing this using subsample and colsample_bylevel help tremendously with the i.i.d problem. My understanding is that this can make the individual trees more i.i.d as it does with a Random Forest. Of course, the ordering of the data is lost by the time you have done all of this.

You may lose all (or most) of the time-series information that you are looking for. In which case, you may need a different method as suggested above. But this can be very effective for some things, IMHO.


#4

Kaggle provides multiple examples (and many winners used XGBoost for time series). For e.g. http://www.georgeburry.com/time-series-xgb/
or
https://mlwhiz.com/blog/2017/12/26/win_a_data_science_competition/

The code that I provided here works. It has produced better results than LSTM in my case (according to MAPE). My issue is actually using it to predict future values.


#5

I am having some problems with producing actual values too. I have found this works well as far as getting an ordering of the predictions (I just sort the predictions).

I thought I was doing well as far a predicting actual values (and maybe I am if the rmse or mae are any indication). But on this blog I discovered ‘base_score’ and corrected it to the median value of my data (I use the mae metric to control outliers).

My ordering remains very accurate (for the type of data) but the predictions (for the extremes or tail-data at least) are all off now.

In summary, I wish I could be more helpful. And I can only predict as far as I have differenced with my data. I cannot see how I could predict any further with the way I do it. But the i.i.d. problem may be adequately managed.

-Jim


#6

Although the practice is not statistically sound, many times you will find the models like XGBoost work well, despite the violation of the i.i.d. assumption.

One trick I’ve used in the past is to include past values in the prediction, i.e. use the lagged dependent as a feature in the data.

If I groked the the blog post you linked correctly, to predict 5 steps ahead as you suggest you can use a similar technique, i.e. for every data point you would create as many copies of it as you want to predict ahead, and include the relevant lagged values as features.

For example, say this was your dataset:

idx, F1, F2, F3, y
1,  34, 100, 0, 5
2,  44,  54,  1, 7
3,  24,  77,  1, 9
4,  34,  64,  1, 13
5,  33,  44,  0, 15

Say we wanted to predict 2 steps ahead for data point 3.

Then we would transform the dataset to create copies of the data point such that the y value corresponds to the step ahead we are trying to predict.

Data point 3 was 3, 24, 77, 1 : 9. We would then create two copies of it, where the y value would become first y_4 (one step ahead), then y_5 (two steps ahead). This is essentially “lagging” the y value by two steps.

idx, F1, F2, F3,  y
3_1,  24,  77,  1, 13 <- Predict one step ahead
3_2,  24,  77,  1, 15 <- Predict two steps ahead

Now, this has the obvious issue that the dependent is different for the same feature values. What we need to do then is to include the lagged values of the dependent as features as well, to ensure that the past values of the dependent also affect future ones.

idx, F1, F2, F3,  F4(y_i-2), F5(y_i-1), y
3_1,  24,  77,  1,  7, 9, 13 <- Predict one step ahead, based on the two last values
3_2,  24,  77,  1,  9, 13 , 15 <- Predict two steps ahead, based on the two last values

I probably have made an off-by-one error here but you get the gist of it. We now have our dataset as it would be in a typical supervised learning problem, where we have “enforced” dependence between consequent data points by including the y value of past data points in the features of existing data points.

Still, before trying complicated models like XGBoost that really was not built for this purpose, you should first teach yourself the basics of forecasting, understand the issues with non-i.i.d. data, and try out simpler solutions like linear regression for forecasting, ARIMA models etc.

The best introductory text on forecasting that I have found is Forecasting: Principles and Practice, which is available for free, includes R code examples and a purpose-built package, assumes little prior knowledge and is focused on getting results out rather than the whole theory of time series analysis.

Hope this helps!


#7

@thvasilo I’ve seen a few users in the past asking the same question. Thanks a lot for the detailed explanation.


#8

Theodore, thank you for you very helpful discussion of the practical uses and limitations of XGBoost.

On a theoretical basis I wonder if this is necessarily true: “Although the practice is not statistically sound……(due to the) violation of the i.i.d. assumption.”

I am not sure about this being proven yet. Dr, Cho is correct, that the i.i.d. assumption is sufficient. But has it been proven to be necessary?

Specifically, I wonder if it may also be sufficient to assume there is adequate mixing (which implies the process is ergodic)?

I wonder if this is analogous to the central limit theorem. It is sufficient to assume i.i.d. to prove the central limit theorem. But one can find multiple proofs of the central limit theorem using the assumption of adequate mixing and ergodicity.

I do not ask this be be a contrarian or to be purely theoretical. In fact, I would be happy with any further discussion of the practical limitations for things that happen to have a time-stamp on them but are made stationary.

But I would also be happy to know of any definite theoretical information (e.g., proofs) that suggests the i.i.d. assumption is necessary and the presence of adequate mixing is never sufficient (or may be sufficient).

Thank you.
-Jim


#9

That’s a good point that I don’t think has seen much research. AFAIK all theoretical analyses of gradient boosting optimization assume the data are i.i.d., I’m not aware of any relaxations to that assumption.

If you want a good analysis of the gradient boosting optimization process, including convergence proofs, I’d recommend Optimization by gradient boosting by Biau and Cadet. You will see there in Section 2.1 that their whole analysis relies on the i.i.d. data assumption.


#10

Thank you for the reference!

In addition to addressing my theoretical question it addresses some of my practical considerations regarding validation and early stopping.

They found that with adequate regularization early stopping may not be necessary.

I have found that with a large enough min_child_weight early stopping may be less necessary. I am still looking into this. And unlike the authors, I am not prepared to let the program run indefinitely without concerns about overfitting.

But definitely helpful all around.

Thanks.

-Jim


#11

Thank you for detailed response.

I understand how to implement ARIMA, NNs or FB Prophet for time-series forecasting. I want to try additional solutions.

As for XGBoost the code that I have provided works fine, but I don’t understand how to do a “predict next X days with this model” action. Due to its scores I wanted to try it, but I don’t understand how.

All Kaggle/github examples that I could find stop at calculating scores for the model, but they don’t provide details about actually using these models for future predictions.


#12

@skybullet for fully unlabeled data you might need to do sequential prediction, i.e. use the prediction made for day t+1 as a feature for day t+2, and the prediction made for day t+2 as a feature for day t+3.


#13

I totally agree with you. However, I don’t know how to properly code it. I could not find any example.