Hi!
I’m trying to use margin (initial prediction) with distributed training on xgboost4j-spark 0.72 on scala and encountering some problems:
- XGBoostEstimator trained on dataset with margins do not use them while predicting.
- Values in summary(XGBoostTrainingSummary) do not look right.
Here is the code I’m using:
import spark.implicits._
import ml.dmlc.xgboost4j.scala.spark.{XGBoost, XGBoostModel, XGBoostClassificationModel, XGBoostEstimator}
import org.apache.spark.ml.linalg.DenseVector
var xgbTop30 = XGBoostModel.load("<path>")
val top_30_full = spark.read.format("libsvm").load("<path>")
val secondValue = udf((v: DenseVector) => v.values(1))
val logloss = udf((label: Double, prediction: Double) => -(label * math.log(prediction) + (1 - label) * math.log(1 - prediction)))
xgbTop30.transform(top_30_full).select(mean(logloss($"label", secondValue($"probabilities")))).show()
xgbTop30.asInstanceOf[XGBoostClassificationModel].setOutputMargin(true)
xgbTop30.asInstanceOf[XGBoostClassificationModel].setPredictionCol("")
val params = Map(
"colsample_bytree" -> 0.95,
"eta" -> 0.175,
"gamma" -> 0.1,
"max_depth"-> 2,
"subsample"-> 0.95,
"objective"-> "binary:logistic",
"baseMarginCol" -> "margin",
"num_round" -> 10,
"tree_method" -> "exact",
"useExternalMemory" -> true,
"eval_metric"-> "logloss")
val train = xgbTop30.transform(top_30_full).withColumn("margin", secondValue($"margin"))
val estimator = new XGBoostEstimator(params)
val model = estimator.fit(train)
model.transform(train).select(mean(logloss($"label", secondValue($"probabilities")))).show()
model.asInstanceOf[XGBoostClassificationModel].setOutputMargin(true)
model.asInstanceOf[XGBoostClassificationModel].setPredictionCol("")
model.transform(train.withColumn("margin_base", $"margin")).withColumn("total_margin", secondValue($"margin") + $"margin_base").withColumn("prob", lit(1.0) / (lit(1.0) + exp(-$"total_margin"))).select(mean(logloss($"label", $"prob"))).show()
println(model.summary)
Output:
import spark.implicits._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostModel, XGBoostClassificationModel}
xgbTop30: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_4109c93129ec
top_30_full: org.apache.spark.sql.DataFrame = [label: double, features: vector]
secondValue: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7)))
logloss: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(DoubleType, DoubleType)))
+-----------------------------------+
|avg(UDF(label, UDF(probabilities)))|
+-----------------------------------+
| 0.1540564135375269|
+-----------------------------------+
res862: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_4109c93129ec
res863: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_4109c93129ec
params: scala.collection.immutable.Map[String,Any] = Map(useExternalMemory -> true, subsample -> 0.95, max_depth -> 2, objective -> binary:logistic, eval_metric -> logloss, baseMarginCol -> margin, num_round -> 10, tree_method -> exact, eta -> 0.175, colsample_bytree -> 0.95, gamma -> 0.1)
train: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 more field]
estimator: ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator = XGBoostEstimator_5ed520eb30b5
model: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_eec7a64f881f
+-----------------------------------+
|avg(UDF(label, UDF(probabilities)))|
+-----------------------------------+
| 0.39942377874998813|
+-----------------------------------+
res865: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_eec7a64f881f
res866: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_eec7a64f881f
+---------------------+
|avg(UDF(label, prob))|
+---------------------+
| 0.13646968577919244|
+---------------------+
XGBoostTrainingSummary(trainObjectiveHistory=List(0.167606, 0.163014, 0.15941, 0.15659, 0.154381, 0.152585, 0.15122, 0.150104, 0.149162, 0.148387), testObjectiveHistory=None)
I’m moving the workflow from local training python notebook and distributed results do not match with it.
On local python version I’ve got more reasonable results:
Initial logloss: 0.15405644
[0] train-logloss:0.148704
[1] train-logloss:0.144599
[2] train-logloss:0.141473
[3] train-logloss:0.139101
[4] train-logloss:0.137302
[5] train-logloss:0.135948
[6] train-logloss:0.134918
[7] train-logloss:0.134139
[8] train-logloss:0.133527
[9] train-logloss:0.133065
Notice the logloss on transformed dataset no not use margins and results in XGBoostTrainingSummary are too high to be real.
Unfortunately, I can’t share the dataset.