XGBoost4J (Spark) with Weighted Loss Column


#1

Hi,

trying my luck with XGBoostRegressor / Classifier objects, in Spark, which are taking into account the “weight_col” parameter.
If I got it right, this value (which is not explained in the official parameters), is giving more weight to errors with a bigger value in this column, as opposed to errors with a small value in this column, i.e. smoothing the error (and getting it more affected / weighted) by this weight column.

  1. Did I get it right? If so, do you mind I’ll add a paragraph (with a new PR) to the documentation, so we’ll have it documented in XGBoost4J as well?
    If I didn’t get it right, can one explain to me how this exactly works? i.e. the exact functionality (here - https://github.com/dmlc/xgboost/issues/3258 it is described how to implement it, now the usage behind the scenes).

  2. Is there a way to know that this really works (and not by just using xgboostRegressor.getWeightCol)? Because I had two runs - 1 with this flag, 1 without - and didn’t see any great affect in the error. How should I debug myself?

Thank you in advance!
Daniel


#2

In the latest XGBoost, there is no parameter called “weight_col”. Instead, you should use setWeightCol(). I do agree that the XGBoost4J-Spark tutorial should include this API. For now, take a look at this snippet:


#3

Hi @hcho3, thank you very much for replying!
2 questions on the above -
1.
I did a minor change in my XGBoost object, added this feature as a flag, see below ->

 val xgboostRegressor = new XGBoostRegressor(Map[String, Any](
  "num_round" -> 100,
  "num_workers" -> 10,  // num of instances * num of cores is the max.
  "objective" -> "reg:logistic",
  "eta" -> 0.1,
  "missing" -> -99.0, // missing - represents the value for missing values (NULL in my case)
  "gamma" -> 0.5,
  "max_depth" -> 6, 
  "early_stopping_rounds" -> 9,
  "seed" -> 1234,
  "lambda" -> 0.4,
  "alpha" -> 0.3,
  "colsample_bytree" -> 0.6,
  "subsample" -> 0.2,
  "weight_col" -> "imps"
  ))

Then, when I used the getter of the weightCol, it indeed showed me the “imps”
xgboostRegressor.getWeightCol -> output is “imps”.

This means that the set works like that? Or should I use the setter with setWeightCol, and define the relevant specific column I’d like from my existing dataframe (i.e. without using a UDF)?

  1. Can you please share few words about the functionality of this weight feature? Therefore I’ll be able to write a description here, and therefore we can take it to the official documentation, after you’ll go through it, of course.

Looking forward!

Thank you Philip,
Daniel


#4

@hcho3 following up here :slight_smile: Thanks!


#5

As I mentioned, there is no parameter called weight_col. The correct way is to use setWeightCol. Are you using the latest XGBoost version?


#6

Yea, 0.9.
Also, the follow up was more for the functionality of the feature, how it works etc. Can you add info here?
Therefore I’ll be able to write something to the official documentation.

Thank you!


#7

Hi @hcho3, trying my luck again and sorry if bothering. Can you kindly share few words about the functionality of this feature? Thank you!


#8

The weight feature lets you assign more significant to some data points relative to other data points when computing the objective function. This is useful when your dataset is imbalanced, eg the positive class is 5% of the training data.


#9

Thank you @hcho3! one last question - if I’m predicting a value between 0 - 1 (not a classic classification), and I have records with the relevant weight column of 1000, and another record of let’s say 1-2 (in their value), and I would like to give the 1000 more weight - that is another good use case, right?


#10

Yes, you may have other reasons to assign bigger weights to some data points.


#11

thank you very much @hcho3 for the detailed answers. Much appreciated!