We have an xor dataset, which means any split will get 0 gain. Is that possible to overfit this dataset with xgboost?
No, since XGBoost uses a Greedy approach when choosing optimal splits. In the case of an XOR data with two features, XGBoost will consider only one feature split at a time.
Thanks for the quick reply. Here is my issue. sometimes, we have a feature which is not informative for single split, because there is no information gain. However, it may make the further splits easier if we split it. For example, we have a dataset, which has the house price as the label, and has features like city name, distance to the center of the city, the location etc. City name is not informative for single split if the average house price is the same for different cities. However, if we split it into different cities, it may make the location split much easier as different locations in different cities are informative. Of course, the easy way is to combine the city name and location into one single feature. But if we have thousands of features, the feature dimension will explode. Any suggestion for this problem will be appreciated!
And one more thing, we can also choose to split the dataset into two smaller datasets for different cities. But this may cause the size of the dataset decrease sharply.
Unfortunately, this is a limitation of XGBoost. You will want to do feature engineering to prevent XOR-type data from appearing. You will need to prevent the situation where all candidate splits would produce a zero gain.
Ok, got it. Thanks a lot:)