Can I manually add a tree to a fitted xgb model? [python]

galnaamani · November 18, 2020, 10:00am

Hi,

I trained a xgb ranker in python, and I want to manually insert a tree to it. Is it possible?

thank you

jiaming · November 18, 2020, 12:27pm

You can save a JSON model and modify it.

galnaamani · November 18, 2020, 12:42pm

Thanks for the answer!
But then can I load it back and do predictions with ‘predict’ in order to check its behavior? How do I load it?

jiaming · November 18, 2020, 3:10pm

You can use save_model and load_model with json file extension. See tutorial in doc for detailed explanation. As long as your modification complies to the model schema, xgboost wouldn’t know about your modification.

galnaamani · November 23, 2020, 7:47am

But when I use save_model the json is not readable, and the readable option (i.e. Booster.dump_model) can’t be loaded back … is there any tool to work with it? How can I read it or how can I load a readable one?

hcho3 · November 23, 2020, 8:09am

What do you mean by JSON not being readable? Can you be more specific?

galnaamani · November 23, 2020, 8:14am

Yeah, when I save it with save_model I get some encoded code which I don’t know how to read (or more specifically how to add trees to it):

7724 bf00 0000 0000 0000 003b 4b62 4540

whereas in the Booster.dump_model I get something I can understand:

{ "nodeid": 0, "depth": 0, "split": "rel_log_rated_orders_by_stemmed_alphabetized_search_query", "split_condition": 0.822961092, "yes": 1, "no": 2, "missing": 1, "children": [

hcho3 · November 23, 2020, 8:17am

Did you specify the JSON extension when calling save_model? It looks like you got the binary format, not JSON format.

bst.save_model('model.json')

hcho3 · November 23, 2020, 8:16am

Also make sure you are using XGBoost version 1.0.0 or later.

galnaamani · November 23, 2020, 2:58pm

Thank you very much! Updating the version worked!

Follow-up questions - I see that the trees (each tree in bst_json['learner']['gradient_booster']['model']['trees']) are described with a dictionary:

is “default_left” describing the direction in which the missing_values go?
The “split_conditions” correspond to the score the model gives all the leaves if I understand correctly? Though when I actually perform the prediction I get these values multiplied by a constant. For example for a model with only 1 tree I see:
final_score = bst_json['learner']['learner_model_param']['base_score'] + constant * split_conditions of the leaf my sample fell in
How is this constant calculated?

Edit: after some trials and errors I realized the leaf value is not in base_weights but rather in the split_conditions list. I edited the message accordingly

hcho3 · November 23, 2020, 4:28pm

Yes
There should be no such constant. If you are using reg:squarederror, the final score should be base_score + split_condition. If the objective is binary:logistic, the final score is sigmoid(base_score + split_condition).

galnaamani · November 23, 2020, 4:59pm

You are right I was with binary:logistic!
Thank you a lot!