What is the mathematical definition of the quantile transformation in xgboost.QuantileDMatrix?

phydev · October 4, 2023, 7:41am

The XGBoost package provides the function xgboost.QuantileDMatrix which takes a numpy.ndarray or pandas.DataFrame as input, applies quantile transformation and stores the data in a sparse representation to improve performance. To the best of my knowledge, if the parameter max_bin is set to be equal or larger to the number of samples in the input data (max_bin>=number_of_samples) then the quantile transformation has no effect since each data point is represented by the median of itself. However, if you do that and inspect the data afterwards with QuantileDMatrix.get_data().data you will find that the lowest value in the data is always replaced by a completely different value. If you have p features, then it will replace one value for each feature.

So how QuantileDMatrix really works? How this quantisation is defined mathematically?

It seems like the lowest value in each column is replaced using the expression min(2x, 0)-1.e-05, but I cannot find any mathematical reasoning behind it.

How to reproduce:

import xgboost as xgb
import pandas as pd
import numpy as np

# define data with numpy
feature1 = np.array([1,2,3,4])

# put it into pandas
a = pd.DataFrame({'feature1': feature1})

quantized_a = xgb.QuantileDMatrix(a, max_bin = 4)

# to show that the behaviour is consistent both with pandas and numpy
quantized_feature1 = xgb.QuantileDMatrix(feature1.reshape(-1, 1), max_bin = 4)

print(quantized_a.get_data().data)
print(quantized_feature1.get_data().data)
# output: [-1.e-05, 2.e+00, 3.e+00, 4.e+00 ]

# different data yields similar problem
feature2 = np.array([10399., 34552., -48585., 70.])
quantized_feature2 = xgb.QuantileDMatrix(feature2.reshape(-1, 1), max_bin = 4)
print(quantized_feature2.get_data().data)

np.testing.assert_almost_equal(feature2, quantized_feature2.get_data().data)
# Arrays are not almost equal to 7 decimals

# Mismatched elements: 1 / 4 (25%)
# Max absolute difference: 48585.
# Max relative difference: 0.5
# x: array([ 10399.,  34552., -48585.,     70.])
# y: array([ 1.0399e+04,  3.4552e+04, -9.7170e+04,  7.0000e+01], dtype=float32)
# in this case -48686 is the value affected, the lowest. 
# If you make it positive, then the value affected 
# is 70 which becomes the lowest one

Here are the requirements:

xgboost>=1.7.6
numpy>=1.23.5
pandas>=1.5.7