Feature weights and sampling

enricorox · October 30, 2023, 6:24pm

Hi guys!

I don’t understand how feature weights work in the following edge case:

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

X = pd.DataFrame(np.random.choice([True, False], size=(10, 3)), columns=list('ABC'))
y = pd.DataFrame(np.random.choice([True, False], size=(10, 1)), columns=list('L'))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=123)

dtrain = xgb.DMatrix(X_train, y_train)
fw = [0, 0, 1]
dtrain.set_info(feature_weights=fw)

params = {"verbosity": 1, "device": "cpu", "objective": "binary:hinge", "tree_method": "hist",
          "colsample_bytree": .7, "seed": 123,
          "eta": .3, "max_depth": 6}

clf = xgb.train(params=params, dtrain=dtrain, num_boost_round=20)

print(clf.get_score(importance_type="weight"))

In this example, I am sampling the features to 70% with the feature weights [0, 0, 1].
The output is {'A': 8.0, 'B': 4.0, 'C': 4.0} but I was expecting only the feature C.

How are weights used in this case where we have too many zeros? Are they ignored?

jiamingy · November 2, 2023, 10:05am

Hi, it’s a weird tradeoff, given 3 features and 0.7 sampling rate, floor(3 * 0.7) = 2. As a result, we need to return at least two features in sampling. In the given example, the sampling rate is effectively limited to 0.3. How to check the weight is consistent with the sampling rate might be more involved.