Xgboost-spark - train multiple models in the same cluster in parallel


#1

I am trying to train multiple models on the same cluster, using a thread pool on the driver side.
This creates multiple spark jobs for each model, as expected.
The application often fails, or hangs - and it seems to be something to do with rabit usage on the xgboost library.

Am I doing something wrong?


#2

Did training multiple jobs ever succeeded?

Could you provide some logs from a driver and an example of your job code you gave troubles work?

As far as I read the code what you want should be feasible.


#3

Some succeeded, some failed, some were stuck forever.
I was able to overcome this by not specifying numWorkers higher than 1 - this seems to be the source of the problem.

Is there something specific you would like to see from the driver log?
I can reproduce it quite easily and give you the logs.