How to tell XGBoost model that 2 features are related and should not be interpreted stand-alone?


#1

XGBoost uses boosting method on decision trees. When I look at the decision-making logic of decision trees, I notice the logic is based on 1 feature at one time. In real life, certain multiple features are related to each other.

Currently, when I feed data to the model, I simply feed all the features to it without telling the model how certain features are related to each other.

Let me describe a hypothetical example to be clearer. Suppose I have 2 features - gender and length of hair. In this hypothetical problem, I know from my domain knowledge that if gender is female, length of hair matters in determining the outcome. If gender is male, length of hair is irrelevant. How do I tell the machine learning model this valuable piece of information so that the model can learn better?

I am using XGBoost on python 3.7


#2

You’d want capability to customize feature selection, and XGBoost currently does not offer such capability yet. In your example, if a test node has the gender as splitting feature, then only the left child node (Female?=Yes) should consider the hair length; the right child node (Female?=No) should ignore it.

See https://github.com/dmlc/xgboost/issues/4230, the proposal to enable user-defined split evaluator.

The closest feature we have is feature interaction constraints. You can forbid the gender feature to interact with the hair length.