Hi,
I am working with xgboost using Python sklearn API interface and sparse matrix (i am doing NLP with Tfidf features).
I have around 200K lines and 50k columns but my matrix is very sparse: less than 50 non null entry per sample (less than 0.1% of non zero elements).
The Python sklearn API handles sparse matrix, but there is a lack of explanations. I saw somewhere that the values of sparse matrix were treated as missing values and not as zeros. But not in doc, only on forums, and always very badly explained, so i cannot being sure of what is happening.
How can I be sure that my zeros are treated as zeros (I don’t have any missing value) ?
Working with dense matrix is unfeasible both for training and inference, due to the size of the dataset.
Is XGboost optimized for sparse matrices, or does it create a dense matrix internally (cancelling the benefit of passing sparse as input, in first place) ?