What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix?
Say I have a dataframe train
with columns row_index
, column_index
, and value
, it would be something like
new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)
However the above code results in a type mismatch because DMatrix expects Array[Long].
train.select(F.collect_list("row_index")).first().getList[Long](0)
seems like a possible option but it doesn’t seem to be memory friendly and scalable.
I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.