Xgboost-spark - train multiple models in the same cluster in parallel

ran.haim · July 25, 2019, 12:59pm

I am trying to train multiple models on the same cluster, using a thread pool on the driver side.
This creates multiple spark jobs for each model, as expected.
The application often fails, or hangs - and it seems to be something to do with rabit usage on the xgboost library.

Am I doing something wrong?

trams · July 31, 2019, 6:15am

Did training multiple jobs ever succeeded?

Could you provide some logs from a driver and an example of your job code you gave troubles work?

As far as I read the code what you want should be feasible.

ran.haim · August 12, 2019, 1:47pm

Some succeeded, some failed, some were stuck forever.
I was able to overcome this by not specifying numWorkers higher than 1 - this seems to be the source of the problem.

Is there something specific you would like to see from the driver log?
I can reproduce it quite easily and give you the logs.