Train using sparse data with GPU


#1

I want to use xgb.train() to train my xgb model and my DMatrix is created by scipy.sparse, which includes many sparse data like following:
xgb.DMatrix(scipy.sparse.vstack(csr_array), label=label_array)
# csr_array is the list of scipy.sparse.csr.csr_matrix.

XGBoost works well with CPU(“tree_method” is “exact”) but when I use GPU (“tree_method” is “exact”) XGB can’t return correct model (train auc is not correct).

I would appreciate it if anyone could advise how to make GPU work well.

PS, if I use dense data, DMatrix is created by dense data,
xgb.DMatrix(numpy.vstack(ar_array), label=label_array)
both CPU and GPU work well. So do I need other action using GPU training for sparse data?


#2

Today I test the sample again. For the sparse sample data, as I mentioned in last post, setup DMatrix by scipy.sparse,csr,csr_matrix, if I use “hist” to train the model xgb workers well. However, if I use “gpu_hist” the model is different from that gotting by “hist” training. There is not any other different thing except “hist” or “gpu_hist”.

Is there some limitation in “gpu_hist” training? Thanks in advance.


#3

Please set missing=0 parameter when creading the DMatrix.


#4

Thank you for your reply. In my sparse sample data, no-occur key has meant missing value. I will re-test it again after reducing my feature’s number in my model.