Handling missing values in spark


#1

Hi,

I would appreciate some clarification on handling of missing values in spark.

  1. there are two pages with contradicting guideline
    https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values
    https://xgboost.readthedocs.io/en/release_0.90/jvm/xgboost4j_spark_tutorial.html
    is the second link still valid?
    is the first link applicable to 0.90 or it is valid only for 1.0?
  2. in Option2 of the first linke
    a) has anyone been able to successfully set missing value in a way that would not affect model accuracy?
    b) is this sentence correct? —an irregular value that is not 0, NaN, or Null and set the “missing” parameter to 0. — or it should be —an irregular value that is not 0, NaN, or Null and set the “missing” parameter to the irregular value—?
  3. if I have zero values in my dataset and have no Nan or Null, what is the best approach? I tried to replace zero with very small numbers, i.e., 1E-15, while the minimum value in my dataset in 1E-3. However, it still affect the accuracy.
  4. Can anyone provide a sample code for Option 1 on how to convert to dense vector?
  5. Why version 1.0 is not in maven yet?