Decision tree in first order boosting and second order boosting

sergun · April 22, 2019, 4:56am

Hi!

Two theoretical questions to our comminity :

Question 1
I see two formulas used in context of Newton optimization method:

x=x_0 - f’(x_0)/f’’(x_0)
f(x)~f(x_0)+f’(x_0)(x-x_0)+1/2f’’(x-x_0)^2

From formula 2 (See original XGBoost article) we know that on each iteration of boosting RMSE with weights h_i and targets -g_i/h_i is optimized.
But direct application of formula 1 to idea of boosting just says “train weak regression model for target -g_i/h_i” which corresponds to usual RMSE optimization (not weighted!).

Are formula 1 and 2 (what is used in 2nd order boosting and XGBoost) different things?

Question 2
Is training tree on each iteration in 2nd order boosting (forget about regularization and other usuful features) the same as training regression tree in usual gradient boosting with samples x_i with weights h_i and targets -g_i/h_i?

It is easy to see that leaf weights are calculated this way - they are weighted means of values belong to each leaf.
But it seems to me that Gain in 2nd order boosting is not variance based as in normal regression decision tree.
Am I right? If yes, which tree construction strategy is better?

Thanks a lot!
Sergey.

hcho3 · April 22, 2019, 6:12am

Q1. The second-order optimization used in XGBoost is not really Newton optimization. Rather, the second-order expansion (formula 2) is a good surrogate of the real loss. Let me explain in more deails.

Let l(y,yhat) represent the loss function representing the “distance” between the true label y and the predicted label yhat. Then the value of total training loss at t-th iteration is as follows:

where f_t(x_i) is the t-th decision tree to be fit and yhat_i^(t-1) represent the sum of predictions given by the first (t-1) trees. It is infeasible to directly optimize the training loss, so we optimize the surrogate function instead:

where g_i and h_i are first and second partial derivatives of the loss function in the second argument:

As long as the loss function l(y,yhat) is convex and twice differentiable, optimizing the surrogate function also optimizes the total training loss. See https://arxiv.org/abs/1402.4419.

Q2. XGBoost allows for a variety of choice of the loss function l(y,yhat), as long as it is convex and twice differentiable.

sergun · April 22, 2019, 12:55pm

Thank a lot @hcho3 !!!

Q1 is clear now.

But Q2 still actual
From the article we know that on each iteration XGBoost optimizes weighted RMSE of -g_i/h_i with weights h_i.

My question is as follows. What if instead of using XGBoost in tree construction I will use DecisionTreeRegressor from sklearn with weights h_i? Will I get the same tree structure and leaf weights? (let’s forget about regularization consider pure idea of boosting)

As to weights the logic in XGBoost and DecisionTreeRegressor are the same (weighted mean):
-\frac{\sum g_i}{\sum h_i} (sorry I am new member and can not insert more than one picture

But probably tree structure will be constructed in different ways in XGBoost and DecisionTreeRegressor.
XGBoost uses:

in Gain.

And Gain in DecisionTreeRegressor is based on variance.

I tried to proof that these two gains aree the same but got they differ.

Any comments about these two methods of tree construction? (XGBoost splitting criteria vs variance)
Which of them is faster/provides better tree structure?

Sergey.

hcho3 · April 22, 2019, 2:14pm

I don’t follow you here. XGBoost optimizes
13%20AM
How is this an RMSE expression?

The answer is no. While both XGBoost and DecisionTreeRegressor produce CART (Classification and Regression Tree), they differ in how each split is chosen. According to https://scikit-learn.org/stable/modules/tree.html#mathematical-formulation, DecisionTreeRegressor uses the impurity criteria as follows:

On the other hand, XGBoost does not use impurity. XGBoost uses the first and second gradients as splitting criteria, namely g_i and h_i. (See my previous post for the definition of g_i and h_i):

The benefit of XGBoost is two-folds:

XGBoost is able to handle missing values natively, whereas scikit-learn requires you to impute (fill in) missing values
XGBoost supports a variety of tasks in addition to regression and classification, such as learning-to-rank, Poisson regression, and Cox survival analysis.

hcho3 · April 22, 2019, 2:17pm

@sergun You are not the first person to ask about the theory behind XGBoost. Would you find it useful if I put up my master’s thesis, which has the detailed pseudocode and derivation of XGBoost theory? Something like

sergun · April 22, 2019, 3:40pm

Hi @hcho3!

Thank you for reply!

But I’m still not happy with this question

When you need to optimize something like with a tree:

… the first idea appears in mind is just to train normal regression tree (like in sklearn) with weighted samples for targets -g_i/h_i.

Following this way you will use exctly the same leaf weights (you can check this) but other splitting criteria (not the same criteria as in XGBoost)

hcho3 · April 22, 2019, 4:03pm

Where does this expression come from? Is that equivalent to
13%20AM
? Note that your expression contains f_t(x_i), whereas my expression does not.

Here are the steps we use to derive the formula from the surrogate function:

hcho3 · April 22, 2019, 4:04pm

Also keep in mind that 1) scikit-learn does not have a provision for missing values and 2) XGBoost supports learning-to-rank and other tasks that scikit-learn does not support.

hcho3 · April 22, 2019, 4:10pm

I don’t think this is true, since XGBoost and DecisionTreeRegressor use different splitting criteria (see my previous post).

sergun · April 22, 2019, 5:52pm

@hcho3 please have a look at this:

On the left side is leaf weight formula from XGBoost, on the right - weighted mean of (-g_i/h_i).
So if tree structure is fixed decision tree in XGBoost and regression tree trained on weighted samples for targets (-g_i/h_i) have equal weights.

hcho3 · April 22, 2019, 7:21pm

@sergun Thanks for clarification. It is indeed possible to express the leaf weight as a weighted mean.

This is also true.

The question still remains, however, whether scikit-learn and XGBoost would use the same criteria for choosing among the split candidates (feature_id, threshold combination).

You may indeed be correct that XGBoost is equivalent to scikit-learn, when the loss function l(y,yhat) is set to L2 (l(y,yhat) = 0.5*(y-yhat)^2). My point still stands that XGBoost supports many loss functions other than L2.

Denisevi4 · May 14, 2019, 2:31pm

@hcho3, Philip I’d love to read your master thesis. Could you post it? Thanks!

hcho3 · September 27, 2021, 1:00am

@Denisevi4 Here it is: https://drive.google.com/file/d/0B0c0MbnP6Nn-eUNRRkVOOGpkbFk/view?usp=sharing&resourcekey=0-nVw3WhovKW5FPvPM5GPHfg.