Running XGBoost4j-Spark on Cloudera Data Science Workbench

savagej · June 18, 2020, 7:32pm

I’m trying to run xgboost4-spark on cloudera data science workbench (CDSW),

When I try to run the tutorial code for XGBoost4j-Spark in CDSW, the spark session bombs and it seems like it’s due to the fact that the xgboost tracker isn’t reachable by the executor on the data node since the ip isn’t resolvable as the CDSW docker container is on a separate network, specific to CDSW.

Error:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=100.66.0.29, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=2}
[java.net](http://java.net/).ConnectException: Connection refused (Connection refused)
       at [java.net](http://java.net/).PlainSocketImpl.socketConnect(Native Method)
       at [java.net](http://java.net/).AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
       at [java.net](http://java.net/).AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
       at [java.net](http://java.net/).AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
       at [java.net](http://java.net/).SocksSocketImpl.connect(SocksSocketImpl.java:392)

Based on the CDSW docs, it seems like I should be able to get this to work if I can get the tracker to run on a specific ip and port, but I can’t for the life of me figure out how to set the DMLC_TRACKER_PORT or the other parameters I might need to get the tracker visible to the executor nodes.
Has anyone here been able to get this working on CDSW or CDH?

thvasilo · June 24, 2020, 1:52am

AFAIK the tracker gets set here and it is possible this gets pulled from an env variable when the tracker is launched.

To set an env variable when you launch a cluster you can use something like:

deps/tracker/dmlc-submit --cluster yarn --num-workers 4 --env S3_REGION=us-west-2

Here I’m setting the S3_REGION env variable, but you should be able to do the same with the DMLC_TRACKER_PORT