Hi Philip, sure!. Please see below.
First, few notes:
I am using Databricks (which contains XGBoost4J 0.9, Spark 2.4.3, Scala 2.11).
In order to reproduce, one can use something as below ->
A dataframe with some (let’s say 1) categorical / string column -
val hasher = new FeatureHasher()
.setInputCols("string_col1", "string_col2")
.setCategoricalCols(Array("string_col1", "string_col2"))
.setNumFeatures(100)
.setOutputCol("hashed_features")
val hashedDf = hasher.transform(myDataFrame)
Once you have a Dataframe that includes the “label” column and it is ready for train, one can use a vector assembler, such as below:
// Vector representation of relevant columns. Taking all of the relevant columns (removing irrelevant) and calling them all "features". This is a need for the XGBoost model.
val assembler = new VectorAssembler()
.setInputCols(relevantModelCols)
.setOutputCol("features")
.setHandleInvalid("keep")
And then, an XGBoost model -
// Third pipeline phase
val xgboostRegressor = new XGBoostRegressor(Map[String, Any](
"num_round" -> 100,
"num_workers" -> 10, // num of instances * num of cores is the max.
"objective" -> "reg:squarederror",
"eta" -> 0.1,
"missing" -> -99.0, // missing - represents the value for missing values (NULL in my case)
"gamma" -> 0.5,
"max_depth" -> 6,
"early_stopping_rounds" -> 9,
"seed" -> 1234,
"lambda" -> 0.4,
"alpha" -> 0.3,
"colsample_bytree" -> 0.6,
"subsample" -> 0.2
))
finally, define a pipeline:
val pipeline = new Pipeline()
.setStages(Array(assembler,
xgboostRegressor))
And then, train it:
// val trainedModel = pipeline.fit(train_to_test) - succeed
val trainedModel = pipeline.fit(trainUpdated)
Errors are below:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.0.234.22, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=10}
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
When I removed the udt column (the output from the hashed), it worked well, however those columns are essential for me.
Let me know if you need more details.