How to tell XGBoost model that 2 features are related and should not be interpreted stand-alone?

helpme1 · October 2, 2019, 4:16am

XGBoost uses boosting method on decision trees. When I look at the decision-making logic of decision trees, I notice the logic is based on 1 feature at one time. In real life, certain multiple features are related to each other.

Currently, when I feed data to the model, I simply feed all the features to it without telling the model how certain features are related to each other.

Let me describe a hypothetical example to be clearer. Suppose I have 2 features - gender and length of hair. In this hypothetical problem, I know from my domain knowledge that if gender is female, length of hair matters in determining the outcome. If gender is male, length of hair is irrelevant. How do I tell the machine learning model this valuable piece of information so that the model can learn better?

I am using XGBoost on python 3.7

hcho3 · October 2, 2019, 6:31am

You’d want capability to customize feature selection, and XGBoost currently does not offer such capability yet. In your example, if a test node has the gender as splitting feature, then only the left child node (Female?=Yes) should consider the hair length; the right child node (Female?=No) should ignore it.

See https://github.com/dmlc/xgboost/issues/4230, the proposal to enable user-defined split evaluator.

The closest feature we have is feature interaction constraints. You can forbid the gender feature to interact with the hair length.

brebbles · June 23, 2020, 10:54am

Using your example - would it work to feature engineer a new column such that it returns the hair length IF the gender == female, and 0 (or some value outside the range) otherwise? Then if the tree first split by male/female it could then split by female hair length on the female branch?