Hello @hcho3
When you google “Apache Spark”, most tutorials and code examples use Scala.
Yes you are right. And spark indeed is only scala 2.11. They are in the process of supporting 2.12 with a lot work already finished.
I was not talking about spark only though. At our company we have several xgboost recommenders. They use a hadoop oozie workflow to create a large feature vector table in hive. In the end training runs on a single node with the default xgboost lib (without spark).
Our rest services and whole Data Science Ecosystem is using Scala 2.12. Thus we compile training code in 2.11 and cross compile some shared One Hot Encoding Code in 2.11 and 2.12. The model trees are exported as plain text.
For the actual recommendations people have written a custom model tree parser. So we are not using Xgboost.predict method at all.
I have seen some projects in the wild where people have compiled xgboost with scala 2.11 and 2.12. they have also published those on maven repos. It would be not ideal to rely on some 3rd party binaries. (e.g findify.io)
Regarding your impression about other jvm languages. As stated, totally agree regarding spark. I think there are a lot of cases where you do not need to spin up a spark cluster as memory is crazy cheap nowdays. We are operating a local buisness professionals network on the german market with 20 Mil User records. For training we try to use only recent subset of data. So it it was only around 1 Mil records. Those fit perfectly into memory. I wouldnt even use oozie and hadoop for his, but those worfklows were default.
Regarding the jvm languages i only know of this survey. Imho it would be awesome to have a pure java jar without scala code. In the end this is the default implementation. This is how eclipse vert.x project is organized for example