In our Spark (2.2) cluster we are working with XGBoost 0.8. We are using two XGBoost APIs, Scala/Spark (trainWithDataFrame) and XGBoost Scala (fit on DMatrix). Our principal metrics of performance are precision and recall. During hyperparameter tuning, I looked at model performance while changing the tree_method parameter between exact and approx and found the models Spark and Scala behaving differently.
- The spark model with approx and exact are identical.
- Then the performance of the XGBoost/Scala is drastically different between exact and approx, with the exact performance on the validation set being worse. For example, for the same recall, precision will drop from 30% to 20% which is huge in our context.
- Performances of models Spark and Scala are very similar for method approx
- Performances of models Spark and Scala are different for method exact
XGBoost/Spark does not allow the exact method. Even if I set tree_method to exact it still uses approx but does not tell me it is doing so. Is that right?
Why is a model Scala with tree_method=exact worse than with approx? (NB There is not much difference in terms of speed of calculation between the two methods.)
Is it possible that the model overfits on split points and actually using approx helps to prevent overfitting?
Is it true that the Spark model uses ONLY approx?
This is a follow-up to my other question which was more specific and remained unanswered. I kept on digging and I finally found that the difference came from the parameter tree_method and that model Spark always uses approx.