Does "refresh" option update split thresholds of tree nodes?

ichi1294 · October 30, 2020, 2:19am

Hi,
I am using xgboost for transfer learning. I am expecting both the leaf node values and the split thresholds of the non-leaf nodes will be adapted to a new set of data.

Specifically, I set the ‘process_type’ parameter to be ‘update’ and ‘updater’ to be ‘refresh’. From the manual I know that the leaf node update is controlled by setting ‘refresh_leaf’. I am just wondering after setting ‘update’=‘refresh’, will the split thresholds of the non-leaf nodes be retrained/updated as well? The manual said that the “node stats” will be updated, but I guess that does not refer to the split thresholds but the statistics like cover or gain of features?

It would be helpful if someone could let me know whether my understanding is right. And if the split thresholds are not updated, is there any reason not doing so?

Thanks a lot!

hcho3 · October 30, 2020, 3:56am

No, split thresholds will not be updated. The reason is that XGBoost is a batch algorithm, requiring the entire training data to be present in order to determine best set of thresholds and features in non-leaf nodes. By the time the model receives a new batch of data, the old batch of data is no longer in memory, so it is not possible to find new split thresholds that would account of both old and new data.

Zhang-Liao · December 25, 2020, 7:03am

@hcho3
Hello, sorry to trouble you. After reading your discussion, I still have some confusion.

For a tree (A (B C D)), A is a non-leaf node, while B, C, and D are leaves.
Will “process_tpye: update” change the number of children of A?

Thanks in advance!

hcho3 · December 25, 2020, 7:30am

No modification is made to the tree structure. The ‘refresh’ updater only modified the output value of leaf nodes.