How to use SparseMatrix?

Hi,
I am working with xgboost using Python sklearn API interface and sparse matrix (i am doing NLP with Tfidf features).

I have around 200K lines and 50k columns but my matrix is very sparse: less than 50 non null entry per sample (less than 0.1% of non zero elements).

The Python sklearn API handles sparse matrix, but there is a lack of explanations. I saw somewhere that the values of sparse matrix were treated as missing values and not as zeros. But not in doc, only on forums, and always very badly explained, so i cannot being sure of what is happening.

How can I be sure that my zeros are treated as zeros (I don’t have any missing value) ?

Working with dense matrix is unfeasible both for training and inference, due to the size of the dataset.

Is XGboost optimized for sparse matrices, or does it create a dense matrix internally (cancelling the benefit of passing sparse as input, in first place) ?

If your matrix is a SciPy sparse matrix (for example scipy.sparse.csr_matrix), then all zeros are treated as the missing value. Zeros (which are also missing values) are not stored in the Scipy sparse matrix, leading to large memory saving.

I don’t have any missing value

You actually do, because you mentioned that only < 50 non-null entries exist per sample.

Is XGboost optimized for sparse matrices

XGBoost actually uses a sparse matrix for internal representation and does not create a dense matrix internally.

Thanks for the answer.

What can I do to ensure they are treated as zeros ?

In SciPy sparse matrix, zero is considered to be equivalent to the missing value.