[jvm-packages][spark] Not filtering columns before fitting is causing failure

gorkemozkaya · March 28, 2021, 3:29am

Hello,

This is anecdotal at the moment and I’ll try to create a reproducible example, but it has been more than once that I have experienced model failures on XGBoost4J spark if I call train without first filtering the dataframe to only the relevant columns: “features”, “label”.

E.g.,

xgbRegressor.fit(train) #FAILS
xgbRegressor.fit(train.select(“features”, “label”)) #WORKS

This happens especially when working with huge datasets and there are significant number of additional columns besides “features” and “label”. Just wanted to share and see if anyone else have experienced this.