Saving instance counts in XGBoost model without breaking backward compatibility


#1

Many users have asked about saving instance counts in XGBoost model (e.g. #3419). Since XGBoost uses binary dumps for model exchange, adding extra fields in TreeModel or RTreeNodeStat will break backward compatibility (i.e. users of older XGBoost versions won’t be able to open model files produced by newer versions of XGBoost). For details, see this line and this line.

Let’s come up with a way to save instance counts while preserving backward compatibility.


TreeParam: is it safe to add new member variables?
#2

Here is my proposal: Use leaf_vector_ to store extra information.

  • Set reserved[0]=0x49436e74 (hex representing ‘ICnt’) in the TreeParam struct, to indicate that the first entry of each leaf_vector_ represents the instance count.
  • Change the type of leaf_vector_ from std::vector<bst_float> to std::vector<LeafVec> where LeafVec is a union defined as
union LeafVec {
  bst_float float_val;
  int32_t int_val;
};
  • Now save instance count to leaf_vector_[0].int_val in each node.

#3

Yes, I think this is the fastest way. Does this change the size of it?


#4

Yes, but this is okay, since existing XGBoost versions already expects to see fields TreeParam::reserved, TreeParam::size_leaf_vector, and TreeModel::leaf_vector_. Backward compatibility would be broken only when we add a new field that’s previously unknown to existing XGBoost versions.


#5

Ok, sounds prefect. Although we could only get the leaf node, but currently this is the best proposal !


#6

I do not know weather we could set the field type as transient ?


#7

We can set leaf_vector_[0].int_val=-1.