Xgb on yarn foreach partition error

jackie930 · January 14, 2019, 9:39am

Hi,

I’m trying to use XGboost on a quite large Dataset (~500G) on Yarn, and keep getting below error when running stage-1 foreach partition after successfully run stage 0-repartition and will keep retry step 0.

Stage Screenshot:

I’m wondering is this because of the program doesn’t get enough resource from the cluster?

The program ran pretty well on a sample set, can someone take a look at it?

Thanks!!

hq.jacob · January 25, 2019, 4:43am

Had the same error, but got no clue yet. See [java-scala]Is XGBoost-Spark training thread-safe?
Really appreciate if somebody could provide some info!

jackie930 · January 25, 2019, 5:48am

Hi jacob,

I solved my problem by setting larger memory on each task. I think you can try to either increase your memory per executor/ memory overhead, or decrease the number of tasks running parallel.

Jackie.