[Solved]How to include XGBoost-spark into Spark dockerized environment

jp2338 · May 7, 2020, 3:33am

Hello!
My purpose is to include xgboost4j-spark into a dockerized spark/PySpark environment, to develop how to integrate PySpark with XGBoost-spark version (I know there was prototype online to integrate PySpark with XGBoost-spark but I need to dev some other functions.)

Environment I use is the dockerized spark/Pyspark is from below (but I changed a bit in order to include XGBoost .jar file for spark config, see below):

The advantage is this dockerized env allows me to use Scala under Jupyter notebook, instead of Scala in shell.
Some key env parameters:
HADOOP_VERSION 3.0.0
APACHE_SPARK_VERSION=2.4.5

What I did is to download compiled XGBoost fat jar version 1.0.0 scala version 2.12 from [maven central], and copied these .jar files (i.e xgboost4j_2.12-1.0.0.jar and xgboost4j-spark_2.12-1.0.0.jar) into spark config:

# Spark and Mesos config
ENV SPARK_HOME=/usr/local/spark
COPY ./src/xgboost4j_2.12-1.0.0.jar $SPARK_HOME/xgboost4j_2.12-1.0.0.jar
COPY ./src/xgboost4j-spark_2.12-1.0.0.jar $SPARK_HOME/xgboost4j-spark_2.12-1.0.0.jar
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip \
    MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so \
    SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --jars=$SPARK_HOME/xgboost4j_2.12-1.0.0.jar,$SPARK_HOME/xgboost4j-spark_2.12-1.0.0.jar --driver-class-path=$SPARK_HOME/*.jar --conf spark.executor.extraClassPath=$SPARK_HOME/*.jar" \
    PATH=$PATH:$SPARK_HOME/bin

I can open a kupyternook with Scala run in it. I can also run Spark from this jyputer, I can also import xgboost without error reported.

Call for help : The problem I met is that when I actually call XGBoostClassifier, I got java.lang.NoSuchMethodError :

(P.S since I am new to community, I can only attach 2 links in one post, so more detailed info is attached to my reply.)

Thank you so much for your help! Any advice is appreciated!

jp2338 · May 6, 2020, 5:02pm

Hello!
Since I am new to XGboost community thus I am allowed to put 2 links in one post.
Here are additional links for reference:
The XGBoost spark jar file link from maven central
compiled XGBoost fat jar version 1.0.0 scala version 2.12 from maven central

I upload my Jupyter notebook here to make the error more clearly.

jp2338 · May 7, 2020, 3:32am

Hi All,
I solved this problem. The key is to ensure all versions alligned (ref : https://github.com/dmlc/xgboost/issues/4399#issuecomment-492887898)
My docker uses Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252), SPARK_VERSION -> 2.4.5, so the far jar files should be for version 2.11, that is , from xgboost-spark_2.11. All other spark related jars are also should for version 2.11