[jvm-packages] Compile scala 2.12 jar

Hello,

currently there is no way to use the compiled xgboost4j jar from maven witch scala 2.12. It is not binary compatible. So linking against the mvn jar does not work.

The scala layers looks really slim. Couldn`t find any specific scala types there.
I wanted to use the jvm classes directly however it is not possible either.

I would suggest to build a different jar for the java code and each one for scala 2.11 and 2.12 code (if possible) .
There are more jvm languages out there. Kotlin or Groovy have a way better compatibility with java and could just use the plain java jar.

@CodingCat What’s our plan regarding Scala 2.12 support?

@dre-hh When you google “Apache Spark”, most tutorials and code examples use Scala. My impression is that Scala is de-facto standard for programming with Spark. (Keep in mind that Spark integration is probably the killer application for many XGBoost4J users) So the decision to move to other JVM languages should not be made lightly.

That said, we should definitely look at ways to make XGBoost4J JAR (without Spark) more broadly compatible.

Hello @hcho3

When you google “Apache Spark”, most tutorials and code examples use Scala.

Yes you are right. And spark indeed is only scala 2.11. They are in the process of supporting 2.12 with a lot work already finished.

I was not talking about spark only though. At our company we have several xgboost recommenders. They use a hadoop oozie workflow to create a large feature vector table in hive. In the end training runs on a single node with the default xgboost lib (without spark).

Our rest services and whole Data Science Ecosystem is using Scala 2.12. Thus we compile training code in 2.11 and cross compile some shared One Hot Encoding Code in 2.11 and 2.12. The model trees are exported as plain text.

For the actual recommendations people have written a custom model tree parser. So we are not using Xgboost.predict method at all.

I have seen some projects in the wild where people have compiled xgboost with scala 2.11 and 2.12. they have also published those on maven repos. It would be not ideal to rely on some 3rd party binaries. (e.g findify.io)

Regarding your impression about other jvm languages. As stated, totally agree regarding spark. I think there are a lot of cases where you do not need to spin up a spark cluster as memory is crazy cheap nowdays. We are operating a local buisness professionals network on the german market with 20 Mil User records. For training we try to use only recent subset of data. So it it was only around 1 Mil records. Those fit perfectly into memory. I wouldnt even use oozie and hadoop for his, but those worfklows were default.

Regarding the jvm languages i only know of this survey. Imho it would be awesome to have a pure java jar without scala code. In the end this is the default implementation. This is how eclipse vert.x project is organized for example

XGBoost4J package currently contains Scala code because it is used by XGBoost4J-Spark. So removing Scala code from XGBoost4J will require nontrivial amount of effort.

For now, let us look into building JARs for Scala 2.12.

I am glad to hear that you are finding XGBoost useful. Thanks for explaining your use case.

1 Like