XGBoost stability and data leak

mangleddata · September 5, 2022, 3:44am

I am working with xgboost on a timeseries prediction involving stocks. The challenges I face are as below

Data leak: Neighboring data points could look alike, so there’s some data leakage
Stability: If I use large number of trees (1K), I see bias:variance dilemma show up (i.e train and valid performance looks great, test looks bad)

What I have done is to reduce the number of trees (~100) and also avoid data leak during hyperparameter evaluation (I use purges CV that doesn’t leak data. Reference: https://blog.quantinsti.com/cross-validation-embargo-purging-combinatorial/). Still there’s some randomness between run to run because I do use all the data for my final run.

Appreciate if anyone has any advice for this. My thinking now is to generate 3 runs and take the average of the predictions (yet to see if it improves stability). A related question I had was, is there a way to control the random sampling of the underlying algorithm ? Perhaps, I could prevent some data leak if I am able to. Thanks everyone!