I use Scala in Azure Databricks with the following setup:
- 5x worker node (
28.0 GB Memory, 8 Cores, 1.5 DBU)
- 1x driver (
14.0 GB Memory, 4 Cores, 0.75 DBU)
5.0 ML Beta (includes Apache Spark 2.4.0, Scala 2.11)
I have a Spark Dataframe with 760k rows with two columns:
- label (
- features (
I want to use
XGBoost on my Dataframe, to train the regression model:
val params = Map( "objective" -> "reg:linear", "max_depth" -> 6, "eval_metric" -> "rmse" ) var model = new XGBoostRegressor(params) .setFeaturesCol("features") .setLabelCol("label") .setTreeMethod("approx") .setNumRound(20) .setNumEarlyStoppingRounds(3) .setUseExternalMemory(true) .setMaxDepth(6) .setNumWorkers(10) val trainedModel = model.fit(trainSample)
After launching, I get the following error:
SIGSEGV (0xb) at pc=0x00007f62a9d33e0e, pid=3954,
What I’ve tried so far:
When I set
1, the training starts, but obviously runs really slow, which I believe is no the way it should be used.
The documentation here: https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html and here: https://docs.databricks.com/spark/latest/mllib/third-party-libraries.html#xgboost does not help at all with my case.
My questions are:
- Is it possible to run XGBoost on Dataset that is bigger than memory of each individual worker? (I assume that it’s YES, but correct me if I’m wrong)
- How to use External Memory properly, so that when I take even bigger dataset XGBoost will do the training?
- Is partitioning of the input Dataframe impacting the training somehow?