XGBoost4J (with Spark) 0.9 unable to train with udt column?


#1

Hi,

I am training XGBoost4j with a few categorical columns after hashing those, using Spark.
The output of this FeatureHasher (https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/ml/feature/FeatureHasher.html) is a udt column that contains those few relevant columns.

When I was using an older version (0.82), I was able to train it without any issues, but now I can’t. The training fails with a long stacktrace but without any concrete values.
When I am removing this column, I can do it, however this column is critical to me.

Can someone assist / did someone ever tackle this?

Thanks,
Daniel


#2

Can you post any script that will reproduce the issue?


#3

Hi Philip, sure!. Please see below.
First, few notes:
I am using Databricks (which contains XGBoost4J 0.9, Spark 2.4.3, Scala 2.11).

In order to reproduce, one can use something as below ->
A dataframe with some (let’s say 1) categorical / string column -

val hasher = new FeatureHasher()
  .setInputCols("string_col1", "string_col2")
  .setCategoricalCols(Array("string_col1", "string_col2"))
  .setNumFeatures(100)
  .setOutputCol("hashed_features")

val hashedDf = hasher.transform(myDataFrame)

Once you have a Dataframe that includes the “label” column and it is ready for train, one can use a vector assembler, such as below:

// Vector representation of relevant columns. Taking all of the relevant columns (removing irrelevant) and calling them all "features". This is a need for the XGBoost model.
val assembler = new VectorAssembler()
  .setInputCols(relevantModelCols)
  .setOutputCol("features")
  .setHandleInvalid("keep")

And then, an XGBoost model -

// Third pipeline phase
val xgboostRegressor = new XGBoostRegressor(Map[String, Any](
  "num_round" -> 100,
  "num_workers" -> 10,  // num of instances * num of cores is the max.
  "objective" -> "reg:squarederror",
  "eta" -> 0.1,
  "missing" -> -99.0, // missing - represents the value for missing values (NULL in my case)
  "gamma" -> 0.5,
  "max_depth" -> 6, 
  "early_stopping_rounds" -> 9,
  "seed" -> 1234,
  "lambda" -> 0.4,
  "alpha" -> 0.3,
  "colsample_bytree" -> 0.6,
  "subsample" -> 0.2
  ))

finally, define a pipeline:

val pipeline = new Pipeline()
      .setStages(Array(assembler, 
                       xgboostRegressor))

And then, train it:

// val trainedModel = pipeline.fit(train_to_test) - succeed
val trainedModel = pipeline.fit(trainUpdated)

Errors are below:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.0.234.22, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=10}
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)

When I removed the udt column (the output from the hashed), it worked well, however those columns are essential for me.

Let me know if you need more details.


#4

I’m not familiar with SparkML. What is UDT? Is it User Defined Type?


#5

Yes Philip, and thank you for replying and being proactive here. This is the output of the FeatureHasher column, which can contain one or more columns that were string (categorical) before, and now are hashed.


#6

Are you applying one-hot encoding here? Isn’t a hash considered categorical (not binary, not ordinal)?


#7

Hi Philip, no, not applying one hot encoding, as those features have high cardinality. Therefore using hashing. Do you have other tips for categorical features with high (1000+) cardinal values?


#8

Or, if not, can you reproduce the issue on your end? If you can not and it works fine for you, I will go and talk to Databricks as well about this, but if you can, then I think it’s a bug on your end. Wdyt?
And thank you very much, once again, for helping here.


#9

@Daniel8hen one thing that might be worthwhile to check is if adding this column results in the VectorAssembler changing the output “features” column from a DenseVector to a SparseVector, it decides which to output depending on how many zeros are in the input features so if you are adding something with high cardinality it is likely to hit this threshold. There was a change in the 0.90 Spark XGBoost version (https://github.com/dmlc/xgboost/pull/4349) to raise an exception if input features were a sparse vector and a non-vero missing value was specified (I see you specified -99). So it might that you are now hitting this exception.


#10

Interesting, I didn’t know you could use hashing to handle categorical data. Is the hash value an integer? And would XGBoost treat the hashed variable as ordinal?

I’m afraid I won’t be of much help here. However, @cpfarrell’s suggestion seems reasonable.


#11

Thank you Philip. Also, you can find info about hashing here - https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/ml/feature/FeatureHasher.html

@cpfarrell, first, thank you very much for your reply and the interesting insight here. If I got it right, this phase should be part of the VectorAssembler? Do you know how to exactly do it while this is the phase? or should it be part of the pipeline (changing from a DenseVector to a SparseVector)? This is indeed interesting.

Looking forward to solve this one with your great help! Cheers.


#12

Also, @cpfarrell, I wrote the documentation to handling null values so definitely that might to be the case… Interesting how things get correlated after some time…
I saw that you have edited the missing values section. @cpfarrell can we go on a quick call and sync about this? This would help me to fully understand the situation and make sure I’m totally sure with what I do.
Once I’ll have notes, I’ll write them here as a comment so other people won’t have to tackle this issue. Cheers!


#13

Sure, happy to chat.

But basically whether sparse or dense vector comes out of VectorAssembler is an internal detail of it. For an example

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1, Vectors.dense([0, 0, 0.5]), "a", 1.0)],
    ["id", "hour", "mobile", "userFeatures", "string", "clicked"])

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

assembler.transform(dataset).select("features").take(1)[0]['features']
# DenseVector([18.0, 1.0, 0.0, 0.0, 0.5])

Adding in a Hashed feature

from pyspark.ml.feature import FeatureHasher

hasher = FeatureHasher(
    inputCols=["string"],
    outputCol="hashed_feature"
)

hashed_assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures", "hashed_feature"],
    outputCol="features")

hashed_df = hasher.transform(dataset)

hashed_assembler.transform(hashed_df).select("features").take(1)[0]['features']
# SparseVector(262149, {0: 18.0, 1: 1.0, 4: 0.5, 24260: 1.0})

Converting your vector to a DenseVector would likely cause you memory issues as the feature hasher creates a very large number of missing features so densifying it would be very expensive


#14

Thanks @cpfarrell again! Where can we talk? Slack maybe? Email with hangouts?


#15

@cpfarrell, I know that it is based on the num of zeros (the decision to “proceed” with a Dense / Sparse vector by VectorAssembler mechanism).
When can we talk to settle this one? Thank you in advance!!


#16

Slack or hangouts would work. Let me know how can contact you on own of them and we can figure out something


#17

Please send me an email to daniel.hen@fyber.com and let’s take it from there. Thanks!


#18

So, after a discussion with Chris, below you can find a short summary about how one can handle this issue ->

This issue basically relies on if a dataset contains a lot of zeros (more zeros than non-zeros), then Spark’s VectorAssembler will return it as a Sparse Vector. This is most likely to happen in the case of usage of OneHotEncoder and FeatureHasher.

The new and current version (XGBoost4J 0.9) has a new assertion which raises an exception if input features were a sparse vector and a non-zero missing value was specified.

Therefore, in a case where there is a missing value flag in the xgboost model, which is not 0, one can handle it with few ways:

  1. Force the Vector

In general cases, I would recommend option 1 from the docs (https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values) first, but if using FeatureHasher then the memory issues will make it not possible. If not possible, one would need to find another way of forcing the VectorAssembler to keep zeros in the final vector.

Other ways can be found below:
Modify Dataset -> Modify XGBoost’s missing flag -> Train model

In case of zero being a meaningful value in a dataset, one should change it to be a different value (e.g. 0.00001)

Set XGBoost missing flag to be 0

Train model

What happens behind the scenes, is that anything which is not presented in the Sparse Vector is considered missing, plus the missing parameter.

Note: A sparse vector is a key-value representation of index-value within the vector, i.e. for [1, [2,5], [3.5, 4.5]] -> 1 is a sparse representation, 2 and 5 are the indices of the 3.5, 4.5 accordingly. The rest of the values are 0 therefore considered missing.

Mean Encoding to the Categorical Features

In a dataset, there might be categorical features (e.g. ID of some entity).

One can handle them by summing per each category its label column, and then divide by the number of times the category appears in the dataset. Example below for a data frame with 4 categorical columns and 1 numeric (label) column.