Scale_pos_weight for regression?


#1

For unbalanced classification problems, one can typically set scale_pos_weight at the ratio of negative and positive instances. For regression problems, what is the recommended approach for setting scale_pos_weight?

Take housing price prediction as an example. The prices are usually skewed to one side. What is the recommendation for handling this type of problems? Thanks.


#2

For housing price prediction, you should transform the target (house price) using the logarithm function. That is, your tree ensemble model should predict log(house price).


#3

I did try transforming the house prices to log scale. But it did not make any difference in terms of prediction accuracy.


#4

Can you describe this in more detail?


#5

Most house price data have positive skew in the distribution. It is also called right-skewed distribution because the right side goes further out than a bell shape.

On a PDF chart, a normally distributed data would have the same Mean, Median, and Mode values. Mode is defined as the element that occurs the most often in the collection. On a PDF chart, it is the X value that corresponds to the highest Y value.

For a data set with a positive skew/right skew, you would have Mode < Median < Mean. I am wondering if XGBoost has any recommendation for this type of data.

From my experience, XGBoost does not benefit from data normalization or standardization. Using log transform on the data does not help either. Is this generally true?


#6

Have you looked at https://www.kaggle.com/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda/notebook? It claims that taking logarithm improves accuracy for right-skewed data.


#7

I took a quick look of the notebook. The author acknowledged that log-transformation would help linear algorithms. For tree-type algorithms, the transformation does not help. Please do a text search on that notebook with keyword “Stephen Chu” who asked him the question and the author responded. In fact, a number of posts indicated the same thing; i.e., log-transformation does not help tree-type algorithms.


#8

Got it. Another suggestion I’d make is to over-sample expensive houses (since they are rare) by assigning higher instance weights.


#9

Thanks. Is it safe to conclude that scale_pos_weight does not help in the regression model? In the past, I changed scale_pos_weight for the classification model and it did help a lot.


#10

Yes, scale_pos_weight does not apply to regression task at all, since it assumes the existence of positive and negative classes. You can try to mimic its effects using instance weights.


#11

I think this recommendation can work well. I use this transformation routinely for skewed data (with outliers) and it seem to improve my results.

I am posting now because I became interested in RMSLE–only to discover that it is not supported in the latest release. But am I effectively using RMSLE when I routinely transform my data by taking the log(1 + L)? L is my untransformed label. I assume my prediction yields log(1+P). P = untransformed prediction.

If I use the RMSE metric AND XGBoost has no problem using this label to predict log(1+P) then I am effectively using the RMSLE metric aren’t I?

Any corrections of errors in my thinking would be greatly appreciated. Any deeper understanding of the benefits of RMSLE would also be appreciated. For example, I would be interested in any experience or theoretical understanding of how effective this metric is for dealing with outliers. Maybe any comparison to MAE for dealing with outliers.

Anyway, the advice seems to work well for me.

-Jim