Alternatives for Matrix::sparseMatrix

jmpanfil · February 4, 2019, 5:20pm

I am working with a dataset that is ~9.5 million rows by 1700 columns, with around 3.3 billion non-zero entries (so approximately 20% dense). I generally use the R Matrix package and the sparseMatrix function to convert my training data to a dgCMatrix and then use xgb.DMatrix as XGBoost input. However, the sparseMatrix function cannot handle 64 bit long vectors. This prevents me from using a sparse matrix. Trying to run on a non-sparse matrix results in very slow training. What are my other options to increase the speed and memory efficiency of model training? I have 400GB RAM available.

hcho3 · February 4, 2019, 9:24pm

Do you mean 64-bit floating point (double) or 64-bit integer (int64_t)?

jmpanfil · February 4, 2019, 9:35pm

I’m not sure exactly. The exact error is described here on stackoverflow. Matrix::sparseMatrix returns an error when trying to create a sparse matrix with more than 2^31 elements.

Error in validityMethod(as(object, superClass)) : long vectors not supported yet: …/…/src/include/Rinlinedfuns.h:137

hcho3 · February 4, 2019, 9:57pm

The sparseMatrix package in R is using 32-bit integer for offsets, meaning that you cannot have more than 2^31 elements in the matrix. Using the dense matrix type is not an option, since it will greatly increase memory consumption and training time.

You should consider using Python, where SciPy sparse matrices support large matrices well. Make sure to use the latest SciPy package.

jmpanfil · February 4, 2019, 10:20pm

Thanks for the response. Is Python the only option? Are there any alternatives in R?

hcho3 · February 4, 2019, 10:57pm

You can first save dgCMatrix into svmlight (LIBSVM) format. Use https://github.com/Laurae2/sparsity for fast, efficient saving of svmlight files.

jmpanfil · February 5, 2019, 7:59pm

I can’t create the dgCMatrix at the moment though due to the long vector issue. Is it possible to create two dgCMatrix, convert them separately to LIBSVM, and then combine those LIBSVM files?

hcho3 · February 5, 2019, 8:17pm

That will work as well.

jmpanfil · February 8, 2019, 8:49pm

After doing the above, I am getting an error

xgb.DMatrix(data   = "data.libsvm",
weight = exposure, 
feature_names = features)

Error:

[14:44:50] amalgamation/../dmlc-core/src/io/input_split_base.cc:195: curr=24731273879,begin=0,end=18446744072671021719,fileptr=0,fileoffset=18446744072671021719
Error in xgb.DMatrix(data = "data.libsvm",  : 
[14:44:50] amalgamation/../dmlc-core/src/io/input_split_base.cc:203: file offset not calculated correctly

hcho3 · February 9, 2019, 6:12pm

How did you split the matrix? You should split it so that each data row has all the feature values (row partition)

jmpanfil · February 11, 2019, 3:25pm

I split it in half, into two dgCMatrix with equal amounts of rows. Both halves have the same features.

sparsity::write.svmlight(file = file_name, 
                       sparseMatrix = matrix_sparse_1, labelVector = labels)

I then combined the LIBSVM files into one file through Windows command prompt.

copy *.libsvm combined_data.libsvm

The error occurs with xgb.DMatrix if I also try to use either half sparse matrix as the input as well.

jmpanfil · February 11, 2019, 5:08pm

I have found a possible issue. I loaded up one of the LIBSVM files, and the sparse matrix object is different than the sparse matrix I used in the write function.

sparsity::write.svmlight(file = file_name, 
                   sparseMatrix = matrix_sparse_1, labelVector = labels)
str(matrix_sparse_1)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:1611810977] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:1715] 0 4783557 7040883 11823377 16606934 21231518 26015074 30798631 35579313 40362870 ...
  ..@ Dim     : int [1:2] 4783557 1714
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:4783557] "00000000001" "00000000002" "00000000003" "00000000004" ...
  .. ..$ : chr [1:1714] "x1" "x2" "x3" "x4" ...
  ..@ x       : num [1:1611810977] 10 17 44 88 99 24 48 38 61 27 ...
  ..@ factors : list()

Then if we load the saved LIBSVM format

loaded_matrix <- sparsity::read.svmlight(file = file_name)
str(loaded_matrix$matrix)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:1611810977] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:1715] 0 4783557 7040883 11823377 16606934 21231518 26015074 30798631 35579313 40362870 ...
  ..@ Dim     : int [1:2] 4783558 1714
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:1611810977] 10 17 44 88 99 24 48 38 61 27 ...
  ..@ factors : list()

Everything is the same except Dim[1] has changed from 4783557 to 4783558. Could this be the issue?

I’m also noticing that the write.svmlight function is rounding numbers due to scientific notation. For example, 9,999,999 becomes 1e+007 in the libsvm file and subsequently in R when loading that LIBSVM.