How to reduce the model randomness? or stabilize the model

When the data has minor changes or no changes how to make the model to predict the same data not much difference? or means predict the results stable?

thanks !

1 Like

for example same data train different models I hope the prediction of different models have same results.
thanks

Try setting reg_lambda to a high value. This should decrease the size of outputs of leaf nodes, making changes less dramatic.

Also, you can generate synthetic data where each row is a small perturbation of a row in the original dataset (label would be the same)

I am now checking

  1. for the same data, if parameters seed, subsample etc has any effects.
  2. for small data changes, how much effect the stable.

someone said for 1. seed may has effect.

But I will also try your suggestion to increase reg_lambda. What your mean large reg_lambda? I am currently setting as 0.8. If i remember correctly , in previously experiments, small reg_lambda get better prediction results and larger reg_lambda may reduce the performance.

yes, in order to want the stable , I may need loss some performance.

I will check

welcome any comments
thx

I know this is an old post, but I too hit issues around model stability (I do randomize seeds) and have lots of features. It’s problematic to build reproducible numbers. I tried adding large number of trees (thousands) and it still didn’t help. Any luck with your experiments

The model should be bit-by-bit reproducible given the same environment (GPU model, number of CPU threads). Some scenarios might have exceptions, for instance, if you are using distributed training, then the data partitioning from the framework (like dask, spark) might not be deterministic.

Feel free to open an issue you have a sample that generates non-deterministic models.

Maybe the right question to ask is when to use subsampling and when do not, the answer to this question might help to avoid using subsampling when it should not be used for data at hand.