Why can we only specify missing value as 0.0 in spark?


#1

I wanted to understand if XGBoost parameter takes any value other than 0.0. as missing. In the online documentation example, I could see that missing was specified as -999.
Eg:-

val booster = new XGBoostClassifier(
Map(
“missing” -> -999.0,
“objective” -> “binary:logistic”,
“eta” -> 0.2,
“max_depth” -> 4,
“num_round” -> 200
)

But when I try to specify the same in my model, it throws an error

java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format”.


#2

Can you check your XGBoost version? The example is only applicable to 0.90 release.


#3

Hi ! I am using 0.90 release.
I am building my project with gradle on Intellij…


#4

Make sure you are setting the right value for setHandleInvalid, as explained in https://xgboost.readthedocs.io/en/release_0.90/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values


#5

I am using feature hasher since string indexer and vector assembler is time consuming. Is there a way to use sethandle invalid with FeatureHasher function.


#6

It looks like the missing value example is not applicable in your use case, since you are using feature hasher. This answers your question “Why can we only specify missing value as 0.0 in Spark?”


#7

@chenqin @CodingCat Have you used feature hasher with XGBoost4J-Spark?