XGBoost4J - Scala dataframe to sparse dmatrix

What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix?

Say I have a dataframe train with columns row_index, column_index, and value, it would be something like

new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)

However the above code results in a type mismatch because DMatrix expects Array[Long].

train.select(F.collect_list("row_index")).first().getList[Long](0) seems like a possible option but it doesn’t seem to be memory friendly and scalable.

I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.

To simplify my goal a little more: I am trying to figure out how to use XGBoost4J on databricks without having to pivot my long format dataframe to wide format, so that I can avoid using memory unnecessarily.

The answer was to use sparse vectors by row rather than trying to create sparse matrix or dmatrix.

train.rdd.map(r => (r.getInt(0), (r.getInt(1), r.getInt(2).toDouble))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))).toDF

I tested scoring a sample of the data in R using Matrix::sparseMatrix and xgboost::dmatrix and the results matched up.