Load python model after saving in xgboost4j and predict on data

I have a use case as follows -

  1. Training is done in XGBoost4J (with Spark behind) to use the parallel training.
  2. Saving the model
  3. Loading it in Python
  4. Trying to predict on new data in Python

For training, I use Spark DataFrame.
For prediction in the Python version, how should I use? I believe numpy array is the relevant solution, however I get errors e.g.:
'numpy.ndarray' object has no attribute 'feature_names'

Even for 1 record - what’s the best practice of generating the data in order to make a prediction on the python version?

I am asking as I think it’s some kind of a use cases that would help anyone, and if needed I can elaborate more on the documentation in here - https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#interact-with-other-bindings-of-xgboost

@hcho3 can you kindly assist / anyone else?


@hcho3 what I did is as follows:

  1. Once the training is done, save the model
  2. Generate data from that Spark Dataframe in Pandas, and then convert it to Dmatrix (don’t think there’s a way to do it directly)
  3. Predicted using the DMatrix and got a prediction.
  4. If using like other service, I see that ndarray does work (E.g. with SageMaker).
    Anyway, any best practice about it?


Yes, that’s probably what you should do.