SparkXGBRanker
from 1.7.0 release requires data to be sorted by qid. It works fine if we have one worker and sorted dataframe. However with multiple workers data comes to them unordered and raises exception:
org.apache.spark.api.python.PythonException: 'xgboost.core.XGBoostError: [17:46:40] …/src/data/data.cc:486: Check failed: non_dec:
qid
must be sorted in non-decreasing order along with data.
Code to reproduce:
from xgboost.spark import SparkXGBRanker
from pyspark.ml.linalg import Vectors
df_train = spark.createDataFrame(
[
(Vectors.dense(1.0, 2.0, 3.0), 0, 0),
(Vectors.dense(4.0, 5.0, 6.0), 1, 0),
(Vectors.dense(9.0, 4.0, 8.0), 2, 0),
(Vectors.sparse(3, {1: 1.0, 2: 5.5}), 0, 1),
(Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, 1),
(Vectors.sparse(3, {1: 8.0, 2: 9.5}), 2, 1),
],
["features", "label", "qid"],
)
df_train = df_train.sort(df_train.qid.asc())
df_test = spark.createDataFrame(
[
(Vectors.dense(1.5, 2.0, 3.0), 0),
(Vectors.dense(4.5, 5.0, 6.0), 0),
(Vectors.dense(9.0, 4.5, 8.0), 0),
(Vectors.sparse(3, {1: 1.0, 2: 6.0}), 1),
(Vectors.sparse(3, {1: 6.0, 2: 7.0}), 1),
(Vectors.sparse(3, {1: 8.0, 2: 10.5}), 1),
],
["features", "qid"],
)
ranker = SparkXGBRanker(qid_col="qid", num_workers=2)
model = ranker.fit(df_train)
model.transform(df_test).show()
Is there any ways to prepare df in sorted order for workers? Or sorting should be done on each worker?