SparkXGBRanker does not work on parallel workers

p-p-m · November 3, 2022, 5:56pm

SparkXGBRanker from 1.7.0 release requires data to be sorted by qid. It works fine if we have one worker and sorted dataframe. However with multiple workers data comes to them unordered and raises exception:

org.apache.spark.api.python.PythonException: 'xgboost.core.XGBoostError: [17:46:40] …/src/data/data.cc:486: Check failed: non_dec: qid must be sorted in non-decreasing order along with data.

Code to reproduce:

from xgboost.spark import SparkXGBRanker
from pyspark.ml.linalg import Vectors
df_train = spark.createDataFrame(
    [
        (Vectors.dense(1.0, 2.0, 3.0), 0, 0),
        (Vectors.dense(4.0, 5.0, 6.0), 1, 0),
        (Vectors.dense(9.0, 4.0, 8.0), 2, 0),
        (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 0, 1),
        (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, 1),
        (Vectors.sparse(3, {1: 8.0, 2: 9.5}), 2, 1),
    ],
    ["features", "label", "qid"],
)
df_train = df_train.sort(df_train.qid.asc())
df_test = spark.createDataFrame(
    [
        (Vectors.dense(1.5, 2.0, 3.0), 0),
        (Vectors.dense(4.5, 5.0, 6.0), 0),
        (Vectors.dense(9.0, 4.5, 8.0), 0),
        (Vectors.sparse(3, {1: 1.0, 2: 6.0}), 1),
        (Vectors.sparse(3, {1: 6.0, 2: 7.0}), 1),
        (Vectors.sparse(3, {1: 8.0, 2: 10.5}), 1),
    ],
    ["features", "qid"],
)
ranker = SparkXGBRanker(qid_col="qid", num_workers=2)
model = ranker.fit(df_train)
model.transform(df_test).show()

Is there any ways to prepare df in sorted order for workers? Or sorting should be done on each worker?

p-p-m · November 3, 2022, 7:17pm

This happens because here: https://github.com/dmlc/xgboost/blob/4bc59ef7c33061d17820137253d617b051a72d65/python-package/xgboost/spark/core.py#L729, because order is not preserved after repartition.

Funny that even if input is sorted and repartitioned self._repartition_needed(dataset) returns true, because it expects first word of the plan to be Repartition.

As a hotfix you it is possible to monkey-patch _repartition_needed method to always return false and make sure that input df is partitioned in advance.

hcho3 · November 3, 2022, 8:46pm

Can you file a GitHub issue about _repartition_needed?