Any way to see which features were selected for each tree, level, split by colsample_by*?


#1

I already posted at github but thought maybe it is more appropriate here as it is not an issue but rather very important feature that will greatly increase ability to debug. If it is not currently implemented I was wondering if a parameter record_selections can be added to train function so that when one does booster.dump, one also sees all the features, along the lines:

booster[0]: (features selected:[feature1,feature3,feature8…])

0:[feature1<0] yes=1,no=2,missing=1,gain=20.1,cover=10 (features selected:[feature1,feature8,…])

1:[feature3<1] yes=3,no=4,missing=3,gain=5,cover=9 (features selected:[feature3,feature11,...])

#2

Do you want to collect all features used in each subtree? Here is a short script that achieves it, using JSON dump and a little bit of Python:

import xgboost
import json
import pprint

def annotate_tree(node):
    if 'children' in node:
        annotate_tree(node['children'][0])
        annotate_tree(node['children'][1])
        node['features_used'] = node['children'][0]['features_used'] + node['children'][1]['features_used'] + [node['split']]
    else:
        node['features_used'] = []

def print_tree(node, depth=0):
    indent = '  ' * depth
    if 'children' in node:
        print(f'{indent}{node["nodeid"]}:[f{node["split"]}<{node["split_condition"]}] yes={node["yes"]}, no={node["no"]}, missing={node["missing"]}, features used in this subtree: {node["features_used"]}')
        print_tree(node['children'][0], depth=depth + 1)
        print_tree(node['children'][1], depth=depth + 1)
    else:
        print(f'{indent}{node["nodeid"]}:leaf={node["leaf"]}')

bst = xgboost.Booster(model_file='xgb.model')
pp = pprint.PrettyPrinter(indent=4)

for tree_id, tree_dump in enumerate(bst.get_dump(dump_format='json')):
    print(f'booster[{tree_id}]:')
    tree = json.loads(tree_dump)
    annotate_tree(tree)
    print_tree(tree)

Example output:

booster[0]:
0:[f29<-9.53674316e-07] yes=1, no=2, missing=1, features used in this subtree: [56, 109, 29]
  1:[f56<-9.53674316e-07] yes=3, no=4, missing=3, features used in this subtree: [56]
    3:leaf=-0.856615365
    4:leaf=0.853982329
  2:[f109<-9.53674316e-07] yes=5, no=6, missing=5, features used in this subtree: [109]
    5:leaf=0.971056461
    6:leaf=-0.963636339
booster[1]:
0:[f29<-9.53674316e-07] yes=1, no=2, missing=1, features used in this subtree: [56, 109, 29]
  1:[f56<-9.53674316e-07] yes=3, no=4, missing=3, features used in this subtree: [56]
    3:leaf=0.856615365
    4:leaf=-0.853982329
  2:[f109<-9.53674316e-07] yes=5, no=6, missing=5, features used in this subtree: [109]
    5:leaf=-0.971056461
    6:leaf=0.963636339
booster[2]:
0:[f60<-9.53674316e-07] yes=1, no=2, missing=1, features used in this subtree: [29, 60]
  1:[f29<-9.53674316e-07] yes=3, no=4, missing=3, features used in this subtree: [29]
    3:leaf=-0.393318802
    4:leaf=0.485989004
  2:leaf=3.19529176
booster[3]:
0:[f60<-9.53674316e-07] yes=1, no=2, missing=1, features used in this subtree: [29, 60]
  1:[f29<-9.53674316e-07] yes=3, no=4, missing=3, features used in this subtree: [29]
    3:leaf=0.393318832
    4:leaf=-0.485989004
  2:leaf=-3.19529128

I do not think it is necessary to add record_selections parameter, since the information you want can be obtained by processing the JSON dump.


#3

Thank you Philip, I have used your code for many years including a version of above snippet. Let me explain in a bit more detail and reason it is very important to have it:

So measuring how good is the feature by either gain and total_gain has issues; gain ignores how many times the feature is useful, the total gain ignores if the features was not selected by colsample_by*. So to get a less random estimate of usefulness of feature. For simplicity let’s say only colsample_tree is used and it is 0.5 with 100 features used in training; if the feature1 was not selected for first tree and that is why its total_gain (or gain) for that reason is not very high it should not be penalized; for that purpose I need to know 50 features used for that tree.

By same logic I need it for each tree, level and split.


#4

@h17 In that case, you actually want to see the space of all candidate features chosen by column sampling? So let’s say we have 100 features in the data matrix. If colsample_bycolumn is set to 0.5, then the first tree will have a list of 50 features, is that right? In that case, it’s not achievable by parsing JSON dump.