Margin issue in distributed xgboost

Hi!

I’m trying to use margin (initial prediction) with distributed training on xgboost4j-spark 0.72 on scala and encountering some problems:

  1. XGBoostEstimator trained on dataset with margins do not use them while predicting.
  2. Values in summary(XGBoostTrainingSummary) do not look right.

Here is the code I’m using:

import spark.implicits._
import ml.dmlc.xgboost4j.scala.spark.{XGBoost, XGBoostModel, XGBoostClassificationModel, XGBoostEstimator}
import org.apache.spark.ml.linalg.DenseVector

var xgbTop30 = XGBoostModel.load("<path>")
val top_30_full = spark.read.format("libsvm").load("<path>")

val secondValue = udf((v: DenseVector) => v.values(1))
val logloss = udf((label: Double, prediction: Double) => -(label * math.log(prediction) + (1 - label) * math.log(1 - prediction)))

xgbTop30.transform(top_30_full).select(mean(logloss($"label", secondValue($"probabilities")))).show()

xgbTop30.asInstanceOf[XGBoostClassificationModel].setOutputMargin(true)
xgbTop30.asInstanceOf[XGBoostClassificationModel].setPredictionCol("")

val params = Map(
     "colsample_bytree" -> 0.95,
     "eta" -> 0.175,
     "gamma" -> 0.1,
     "max_depth"-> 2,
     "subsample"-> 0.95,
     "objective"-> "binary:logistic", 
     "baseMarginCol" -> "margin",
     "num_round" -> 10,
     "tree_method" -> "exact",
     "useExternalMemory" -> true,
     "eval_metric"-> "logloss")
     
val train = xgbTop30.transform(top_30_full).withColumn("margin", secondValue($"margin"))

val estimator = new XGBoostEstimator(params)
val model = estimator.fit(train)

model.transform(train).select(mean(logloss($"label", secondValue($"probabilities")))).show()

model.asInstanceOf[XGBoostClassificationModel].setOutputMargin(true)
model.asInstanceOf[XGBoostClassificationModel].setPredictionCol("")

model.transform(train.withColumn("margin_base", $"margin")).withColumn("total_margin", secondValue($"margin") + $"margin_base").withColumn("prob", lit(1.0) / (lit(1.0) + exp(-$"total_margin"))).select(mean(logloss($"label", $"prob"))).show()

println(model.summary)

Output:

import spark.implicits._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostModel, XGBoostClassificationModel}
xgbTop30: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_4109c93129ec
top_30_full: org.apache.spark.sql.DataFrame = [label: double, features: vector]
secondValue: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7)))
logloss: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(DoubleType, DoubleType)))
+-----------------------------------+
|avg(UDF(label, UDF(probabilities)))|
+-----------------------------------+
|                 0.1540564135375269|
+-----------------------------------+

res862: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_4109c93129ec
res863: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_4109c93129ec
params: scala.collection.immutable.Map[String,Any] = Map(useExternalMemory -> true, subsample -> 0.95, max_depth -> 2, objective -> binary:logistic, eval_metric -> logloss, baseMarginCol -> margin, num_round -> 10, tree_method -> exact, eta -> 0.175, colsample_bytree -> 0.95, gamma -> 0.1)
train: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 more field]
estimator: ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator = XGBoostEstimator_5ed520eb30b5
model: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_eec7a64f881f
+-----------------------------------+
|avg(UDF(label, UDF(probabilities)))|
+-----------------------------------+
|                0.39942377874998813|
+-----------------------------------+

res865: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_eec7a64f881f
res866: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_eec7a64f881f
+---------------------+
|avg(UDF(label, prob))|
+---------------------+
|  0.13646968577919244|
+---------------------+

XGBoostTrainingSummary(trainObjectiveHistory=List(0.167606, 0.163014, 0.15941, 0.15659, 0.154381, 0.152585, 0.15122, 0.150104, 0.149162, 0.148387), testObjectiveHistory=None)

I’m moving the workflow from local training python notebook and distributed results do not match with it.

On local python version I’ve got more reasonable results:

Initial logloss: 0.15405644

[0]	train-logloss:0.148704
[1]	train-logloss:0.144599
[2]	train-logloss:0.141473
[3]	train-logloss:0.139101
[4]	train-logloss:0.137302
[5]	train-logloss:0.135948
[6]	train-logloss:0.134918
[7]	train-logloss:0.134139
[8]	train-logloss:0.133527
[9]	train-logloss:0.133065

Notice the logloss on transformed dataset no not use margins and results in XGBoostTrainingSummary are too high to be real.


Unfortunately, I can’t share the dataset.

Hello @Clamoris could you describe what you are trying to do here? Are you transforming your labels through a logistic transform?

Did you verify that the transformation has the correct results on the distributed and local side, i.e. ensure the training data are equivalent before training XGBoost?

Hi @thvasilo. I’m solving a binary classification problem on dataset gathered from thousands of agents. I’ve already trained one common for all XGBoost model on it (xgbTop30 in the example above) and trying to train a one-per-agent model using initial prediction from the common model. Each model will be trained on a subset of events from one agent only. I’m expecting to achieve a better quality of prediction on new data combining resulting models. The same concept works quite well on locally trained models, but is hard to automate - so I’m looking for using distributed training.

Example code is here to illustrate the problem - the predictions of the second model are calculated without using the initial prediction of first model and also model.summary results do not look like anything to me. By combining margins from both models and using logistic transform on the sum I’m getting more realistic results, but they do not match model.summary at all.

I’ve compared the results of first, common model (xgbTop30) with the results of the same model on the same dataset in python/scala but locally computed - both predictions and margins are the same. Logloss on the predictions of local and distributed model is the same.

Hope you can help me.
Thanks.

@thvasilo I’ve made a small example on open dataset to replicate the first issue:

dataset - https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt

import spark.implicits._
import ml.dmlc.xgboost4j.scala.spark.{XGBoost, XGBoostModel, XGBoostClassificationModel, XGBoostEstimator}
import org.apache.spark.ml.linalg.DenseVector

val params = Map(
     "max_depth"-> 2,
     "objective"-> "binary:logistic", 
     "eval_metric"-> "logloss")
     
val sample_libsvm_data = spark.read.format("libsvm").load("/user/degunov/sample_libsvm_data.txt")

val main_model = XGBoost.trainWithDataFrame(sample_libsvm_data, params, 5, 1)

val secondValue = udf((v: DenseVector) => v.values(1))
val logloss = udf((label: Double, prediction: Double) => -(label * math.log(prediction) + (1 - label) * math.log(1 - prediction)))

val manual_main_logloss = main_model.transform(sample_libsvm_data).select(mean(logloss($"label", secondValue($"probabilities")))).first()(0)
println(s"Main model manually calculated logloss: $manual_main_logloss")

val summary_main_logloss = main_model.summary.trainObjectiveHistory.last
println(s"Main model summary logloss: $summary_main_logloss")

main_model.asInstanceOf[XGBoostClassificationModel].setOutputMargin(true)
main_model.asInstanceOf[XGBoostClassificationModel].setPredictionCol("")

val auxiliary_params = Map(
     "max_depth"-> 2,
     "objective"-> "binary:logistic", 
     "baseMarginCol" -> "margin",
     "num_round" -> 10,
     "eval_metric"-> "logloss")
 
val auxiliary_train = main_model.transform(sample_libsvm_data).withColumn("margin", secondValue($"margin"))

val estimator = new XGBoostEstimator(auxiliary_params)
val auxiliary_model = estimator.fit(auxiliary_train)

val manual_auxiliary_logloss = auxiliary_model.transform(auxiliary_train).select(mean(logloss($"label", secondValue($"probabilities")))).first()(0)
println(s"Auxiliary model manually calculated logloss: $manual_auxiliary_logloss")

val summary_auxiliary_logloss = auxiliary_model.summary.trainObjectiveHistory.last
println(s"Auxiliary model summary logloss: $summary_auxiliary_logloss")

auxiliary_model.asInstanceOf[XGBoostClassificationModel].setOutputMargin(true)
auxiliary_model.asInstanceOf[XGBoostClassificationModel].setPredictionCol("")

val logloss_from_margins = auxiliary_model.transform(auxiliary_train.withColumn("margin_base", $"margin")).withColumn("total_margin", secondValue($"margin") + $"margin_base").withColumn("prob", lit(1.0) / (lit(1.0) + exp(-$"total_margin"))).select(mean(logloss($"label", $"prob"))).first()(0)

println(s"Auxiliary model logloss calculated from margins: $logloss_from_margins")

Output:

Main model manually calculated logloss: 0.14111486039930413
Main model summary logloss: 0.141115
Auxiliary model manually calculated logloss: 0.13337850311155489
Auxiliary model summary logloss: 0.022973
Auxiliary model logloss calculated from margins: 0.022972981802751626

After some debugging I’ve determined that the second issue (inconsistent summary and actual logloss) depends heavily on dataset size and partition count. On small datasets both metrics are almost always equal, but on my datasets I’ve saw more than 10% deviation. Also, in such cases, logloss on first 1-2 rounds in summary is worse than that on the main model, which is highly suspicious.

Here is a small sample from my dataset, if you’d like to test it yourself: https://www.dropbox.com/s/rr2gesfq9m2dnmc/xgboost_margin_issue_dataset.libsvm?dl=0

Hi, have you solved the issue that the second model( trained with margins) does not use them for predicting?

Hi @gexu. No, I did not. I plan to revisit this issue in a couple of days thou.

@Clamoris Have you figured out the issue yet?

I faced a similar issue while trying to use groupCol param for traing L2R model. In my case I figured out that I needed to pass the name of param as group_col instead of groupCol. The library internally converts the underscore name to camelcase.

I strongly believe this should be the issue with you as well. BTW, I am using 0.90 version of xgBoost