Fixing the feature used for the first split

mhl2008 · October 3, 2021, 7:43pm

I am looking for a way to control the first feature that is used for the first split. In other words, I want to force the GBM model to split first on a particular feature. Any idea if there is an easy way to do this?

Some context for a possible use-case: Suppose we have two populations A and B and these two populations are different enough. We can build two separate models on these populations, but the problem is that we need to maintain two models. Alternatively, we can create an indicator for these populations, feed the data for these two populations to our GBM, and let the model decide on how to split and distinguish between these two populations with the help of the indicator value we created. This way we only need to maintain one model. However, we see that the model performance in the latter case (one model) is not as good as the former (two models). One way to solve this issue is to fix the first split by force the model to split on the indicator column first and then let GBM decide the next splits.

jiamingy · October 21, 2021, 6:10am

I don’t think that’s possible in short term. Maybe one can invent some sort of split constraint like mono constraint or feature interaction constraint. Feel free to explore a general solution and write about it.