Retrieving and modifying XGBoost weights


#1

I am using xgboost library to train a binary classifier. I would like to prevent data leakage from trained algorithm by adding noise to the weights (e.g. values at the leaf nodes of trees in the ensemble). For that I need to retrieve weights for each tree and modify them.

I can see the weights by using dump_model or trees_to_dataframe on the Booster object, which I define as

model = xgb.Booster(params, [dtrain])

The latter method returns a Pandas dataframe

   Tree  Node    ID                          Feature  Split   Yes    No Missing        Gain     Cover
0      0     0   0-0                           tenure   17.0   0-1   0-2     0-1  671.161072  1595.500
1      0     1   0-1      InternetService_Fiber optic    1.0   0-3   0-4     0-3  343.489227   621.125
2      0     2   0-2      InternetService_Fiber optic    1.0   0-5   0-6     0-5  293.603149   974.375
3      0     3   0-3                           tenure    4.0   0-7   0-8     0-7   95.604340   333.750
4      0     4   0-4                     TotalCharges  120.0   0-9  0-10     0-9   27.897919   287.375
5      0     5   0-5                Contract_Two year    1.0  0-11  0-12    0-11   32.057739   512.625
6      0     6   0-6                           tenure   60.0  0-13  0-14    0-13  120.693176   461.750
7      0     7   0-7  TechSupport_No internet service    1.0  0-15  0-16    0-15   37.326447   149.750
8      0     8   0-8  TechSupport_No internet service    1.0  0-17  0-18    0-17   34.968536   184.000
9      0     9   0-9                  TechSupport_Yes    1.0  0-19  0-20    0-19    0.766754    65.500
10     0    10  0-10                MultipleLines_Yes    1.0  0-21  0-22    0-21   19.335510   221.875
11     0    11  0-11                 PhoneService_Yes    1.0  0-23  0-24    0-23   19.035950   281.125
12     0    12  0-12                             Leaf    NaN   NaN   NaN     NaN   -0.191398   231.500
13     0    13  0-13   PaymentMethod_Electronic check    1.0  0-25  0-26    0-25   43.379410   320.875
14     0    14  0-14                Contract_Two year    1.0  0-27  0-28    0-27   13.401367   140.875
15     0    15  0-15                             Leaf    NaN   NaN   NaN     NaN    0.050262    94.500
16     0    16  0-16                             Leaf    NaN   NaN   NaN     NaN   -0.052444    55.250
17     0    17  0-17                             Leaf    NaN   NaN   NaN     NaN   -0.058929   111.000
18     0    18  0-18                             Leaf    NaN   NaN   NaN     NaN   -0.148649    73.000
19     0    19  0-19                             Leaf    NaN   NaN   NaN     NaN    0.161464    63.875

where leaf values are stored in column Gain (leaf nodes are those that have value Leaf in column Feature). Hence I could add noise to the respective rows in the Gain column, however I then do not know how to convert the Pandas dataframe back to the Booster object/XGBoost model. How should I go about achieving this? Or is there some other and better way for retrieving and modifying XGBoost leaf nodes’ values?


#2

I couldn’t find a way to do this, maybe someone else has some input.

It is possible to extract the model to text and there is ongoing work for a JSON model format that should solve this type of issue.