In our Spark (2.2) cluster we are working with XGBoost 0.8. We are using two XGBoost APIs, Scala/Spark (trainWithDataFrame) and XGBoost Scala (fit on DMatrix). Our principal metrics of performance are precision and recall. During hyperparameter tuning, I looked at model performance while changing the tree_method parameter between exact and approx and found the models Spark and Scala behaving differently.
- The spark model with approx and exact are identical.
- Then the performance of the XGBoost/Scala is drastically different between exact and approx, with the exact performance on the validation set being worse. For example, for the same recall, precision will drop from 30% to 20% which is huge in our context.
- Performances of models Spark and Scala are very similar for method approx
- Performances of models Spark and Scala are different for method exact
Questions:
-
XGBoost/Spark does not allow the exact method. Even if I set tree_method to exact it still uses approx but does not tell me it is doing so. Is that right?
-
Why is a model Scala with tree_method=exact worse than with approx? (NB There is not much difference in terms of speed of calculation between the two methods.)
-
Is it possible that the model overfits on split points and actually using approx helps to prevent overfitting?
-
Is it true that the Spark model uses ONLY approx?
Context:
This is a follow-up to my other question which was more specific and remained unanswered. I kept on digging and I finally found that the difference came from the parameter tree_method and that model Spark always uses approx.