[jvm-packages] v0.90 sparse vector prediction issue on missing values

Hi,

I use sparse vector for XGBoost4J-Spark. fit() threw “java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format”. By adding ‘missing’ 0.0 to parameters, fit() is now happy.

transform() still threw “ERROR ml.dmlc.xgboost4j.java.DataBatch - java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format”. model.getMissing() shows nan. Looking at the code, both XGBoostClassifier and XGBoostRegressor have setMissing(). However, XGBoostRegressionModel/XGBoostClassificationModel does not.

Did I miss anything? Is there a way to set missing in model to make transform() happy? Thank you!

Take a look at the tutorial on handling missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values

@hcho3 Thanks! The tutorial helped solve the problem on the model training part by adding “missing 0.0” to parameters. I am still seeing the problem on the prediction part.

That’s weird, can you explicitly call setMissing() on XGBoostRegressor before calling fit()?

I am using XGBoostClassifier. xgboost_classifier.setMissing(0.0) helped fit(). Similarly, model.transform() failed with “java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format”. Looks like missing is set correctly in XGBoostClassifier, but it is not passed to XGBoostClassificationModel.

This looks like a bug. I filed a report at https://github.com/dmlc/xgboost/issues/4530

you can use “xgbClassificationModel.set(xgbClassificationModel.missing, 0)”

1 Like

Cool, we should add it to the tutorial

I actually should expose these methods to both classifier/regressor and models…

Thanks, that would be great!

I also encountered a similar problem,i have a model file already, and the model has no default missing value specified. When i load the model with spark 2.4.7 in pyspark, I cant set missing successfully,this is my code:

model_path = '***.model'
scala_xgb = spark.sparkContext._jvm.ml.dmlc.xgboost4j.scala.XGBoost
jbooster = scala_xgb.loadModel(model_path)

N_CLASS = 2
xgb_cls_model = JavaWrapper._new_java_obj(
        "ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel",
        "xgbc", N_CLASS, jbooster)

jpred = xgb_cls_model.transform(test._jdf)

pred = DataFrame(jpred, spark)

I tried xgb_cls_model.setMissing(np.nan) and it did not work. Do you have any suggestionss please?