XGBoost on spark on Windows 10: 'JavaPackage' object is not callable


#1

Hi,

I am able to run xgboost on spark in CentOs once I built the Java packages and added the .jars to this env variable:

    os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘ — jar <my_path>\xgboost-jars\xgboost4j-0.90-spark.jar <my_path>\xgboost-jars\xgboost4j-0.90.jar pyspark-shell'
    from sparkxgb import XGBoostClassifier
    xgboost = XGBoostClassifier(
     featuresCol=…
    …
    nthread=…
    )

However I get this problem:

TypeError: ‘JavaPackage’ object is not callable
when copying these jars to my Windows 10 Enterprise and setting:

%env SPARK_JARS=’<my_path>\xgboost-jars\xgboost4j-spark-0.90.jar;<my_path>\xgboost-jars\xgboost4j-0.90.jar’
or
%env PYSPARK_SUBMIT_ARGS=’<my_path>\xgboost-jars\xgboost4j-spark-0.90.jar;<my_path>\xgboost-jars\xgboost4j-0.90.jar’

I know my spark is working because I was able to run this code on Windows.

Found similar issue is here, but discussion is closed and solution didn’t help me.

Java version “1.8.0_162”
Python 3.7.3
Scala code runner version 2.13.0

Any idea?
Thanks a lot!


#2

Did you generate the JAR file from CentOS to Windows? Then the JAR file won’t contain the correct binary to run on Windows. Note: XGBoost JAR contains native code and thus need to be compiled separately for each OS platform.


#3

Thank you for replying to me!
Yes, I generated the JARs in CentOS and copied over to Windows. According to your answer, I need to generate them in Windows. So to follow these steps from here:
mvn -DskipTests=true package
mvn -DskipTests install

I downloaded apache-maven-3.6.1-bin.tar.gz (instructions), uncompressed it and set my Windows env variables to:
created M3_HOME= C:\Users<user>\Documents\apache-maven-3.6.1
added to PATH %M3_HOME%\bin

However when I run ‘mvn test’ under the folder ‘/xgboost/jvm-packages’, I get this error:


#4

Need to break into two posts because of the limit of links.
This is the error:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for XGBoost JVM Package 0.90:
[INFO]
[INFO] XGBoost JVM Package … FAILURE [ 21.736 s]
[INFO] xgboost4j … SKIPPED
[INFO] xgboost4j-spark … SKIPPED
[INFO] xgboost4j-flink … SKIPPED
[INFO] xgboost4j-example … SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21.980 s
[INFO] Finished at: 2019-07-10T11:36:01-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Plugin org.scalatest:scalatest-maven-plugin:1.0 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.scalatest:scalatest-maven-plugin:jar:1.0: Could not transfer artifact org.scalatest:scalatest-maven-plugin:pom:1.0 from/to central (https: / / repo . maven . apache . org / maven2): Connect to repo . maven . apache . org:443 [repo . maven . apache . org / 151 . 101 . 40 . 215] failed: Connection timed out: connect -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch .
[ERROR] Re-run Maven using the -X switch to enable full debug logging .
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http: / / cwiki . apache . org / confluence / display / MAVEN / PluginResolutionException

Any idea?
Thanks!


#5

For some reason, you may be experiencing connection issues when connecting to Maven Central. I’m afraid there’s not much we can help here.


#6

You are right. I have solved the connection problem by adding this file ~/.m2/settings.xml. For the ones that had the same problem, the content of the file is here.

Now I am building the python packages. Will update once it is done.

Thanks, @hcho3!


#7

Some progress but not there yet…

On /xgboost/jvm-packages (release_0.90), when I run ‘mvn clean package’, I get these errors:

...
[ERROR] error: java.lang.NoClassDefFoundError: javax/tools/ToolProvider
...
[INFO] Reactor Summary for XGBoost JVM Package 0.90:
[INFO]
[INFO] XGBoost JVM Package ................................ SUCCESS [  9.093 s]
[INFO] xgboost4j .......................................... FAILURE [ 16.524 s]

...
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project xgboost4j: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: -10000 (Exit value: -10000) -> [Help 1]

What I am using:
$ java -version
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
java version “12.0.1” 2019-04-16
Java™ SE Runtime Environment (build 12.0.1+12)
Java HotSpot™ 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

I have also tried with this Java version:
java -version
java version “1.8.0_162”
Java™ SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot™ 64-Bit Server VM (build 25.162-b12, mixed mode)
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true

$ scala -version
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
Scala code runner version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.

$ mvn -version
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T12:00:29-07:00)
Maven home: <path>\apache-maven-3.6.1
Java version: 12.0.1, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk-12.0.1
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"

$ cmake -version
cmake version 3.14.4

Please any suggestion?
Thanks!


#8

Just a quick update:
Probably it won’t work with JDK 12 as I tested, but by mistake I was pointing to the jre instead of JDK1.8. Testing now with some progress. Will update if everything goes well. :slight_smile:


#9

Now the JAR files were created! But here is the current error not yet solved:

    ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "<path>\spark-2.4.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred: Traceback (most recent call last):
  File "<path>\spark-2.4.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "<path>\spark-2.4.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving


---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-8-2f8cc998c6cc> in <module>
     16     growPolicy='lossguide',
     17     numWorkers=executors_per_node*nodes,
---> 18     nthread=cores_per_executor
     19 )

~\Documents\spark-2.4.3-bin-hadoop2.7\python\pyspark\__init__.py in wrapper(self, *args, **kwargs)
    108             raise TypeError("Method %s forces keyword arguments." % func.__name__)
    109         self._input_kwargs = kwargs
--> 110         return func(self, **kwargs)
    111     return wrapper
    112 

~\AppData\Local\Temp\spark-605454e4-6f56-4a13-978a-7d3e1bb4a678\userFiles-a31377bb-1156-4ab0-97b7-4eacff20e871\sparkxgb_0.83.zip\sparkxgb\xgboost.py in __init__(self, alpha, baseMarginCol, baseScore, cacheTrainingSet, checkpointInterval, checkpointPath, colsampleBylevel, colsampleBytree, contribPredictionCol, customEval, customObj, eta, evalMetric, featuresCol, gamma, growPolicy, interactionConstraints, labelCol, reg_lambda, lambdaBias, leafPredictionCol, maxBins, maxDeltaStep, maxDepth, maxLeaves, maximizeEvaluationMetrics, minChildWeight, missing, monotoneConstraints, normalizeType, nthread, numClass, numEarlyStoppingRounds, numRound, numWorkers, objective, objectiveType, predictionCol, probabilityCol, rateDrop, rawPredictionCol, sampleType, scalePosWeight, seed, sketchEps, skipDrop, subsample, threshold, timeoutRequestWorkers, trackerConf, trainTestRatio, treeLimit, treeMethod, useExternalMemory, verbosity, weightCol)
    112         super(XGBoostClassifier, self).__init__()
    113         self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier", self.uid)
--> 114         self._create_params_from_java()
    115         self._setDefault()  # We get our defaults from the embedded Scala object, so no need to specify them here.
    116         kwargs = self._input_kwargs

~\Documents\spark-2.4.3-bin-hadoop2.7\python\pyspark\ml\wrapper.py in _create_params_from_java(self)
    147         SPARK-10931: Temporary fix to create params that are defined in the Java obj but not here
    148         """
--> 149         java_params = list(self._java_obj.params())
    150         from pyspark.ml.param import Param
    151         for java_param in java_params:

~\Documents\spark-2.4.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~\Documents\spark-2.4.3-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

~\Documents\spark-2.4.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
--> 336                 format(target_id, ".", name))
    337     else:
    338         type = answer[1]

Py4JError: An error occurred while calling o61.params

My guess is that I am not properly importing the from sparkxgb import XGBoostClassifier

The code I am using is:

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
import os, sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.7-src.zip'))
%env SPARK_JARS='<path>\.m2\repository\ml\dmlc\xgboost4j\0.90\xgboost4j-0.90.jar;<path>\.m2\repository\ml\dmlc\xgboost4j-spark\0.90\xgboost4j-0.90.jar pyspark-shell'
import pyspark 
...

executors_per_node = 7
nodes=1
cores_per_executor=8
task_per_core=1

cache_size=50
total_size=340000

conf = SparkConf() \
    .set('spark.default.parallelism', f'{nodes*executors_per_node*cores_per_executor*task_per_core}') \
    .set('spark.executor.instances', '{:d}'.format(executors_per_node*nodes)) \
    .set('spark.files.maxPartitionBytes', '256m') \
    .set('spark.app.name', 'pyspark_final-xgboost-0.90-Egor') \
    .set('spark.rdd.compress', 'False') \
    .set('spark.serializer','org.apache.spark.serializer.KryoSerializer') \
    .set('spark.executor.cores','{:d}'.format(cores_per_executor)) \
    .set('spark.executor.memory', '{:d}m'.format(int(math.floor(nodes*total_size/(nodes*executors_per_node)))-1024-int(math.floor(cache_size*1024/(nodes*executors_per_node))))) \
    .set('spark.task.cpus',f'{cores_per_executor}') \
    .set('spark.driver.memory','24g') \
    .set('spark.memory.offHeap.enabled','True') \
    .set('spark.memory.offHeap.size','{:d}m'.format(int(math.floor(cache_size*1024/(nodes*executors_per_node))))) \
    .set('spark.executor.memoryOverhead','{:d}m'.format(int(math.floor(cache_size*1024/(nodes*executors_per_node)))+3000)) \
    .set('spark.sql.join.preferSortMergeJoin','False') \
    .set('spark.memory.storageFraction','0.5') \
    .set('spark.executor.extraJavaOptions', \
         '-XX:+UseParallelGC -XX:+UseParallelOldGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps') \
    .set("spark.jars", "<path>\\\\Documents\\IntelFederal\\spark\\xgboost-jars\\xgboost4j-spark-0.90-Egor.jar")\
    .set("spark.driver.extraClassPath", "<path>\\\\Documents\\IntelFederal\\spark\\xgboost-jars\\xgboost4j-spark-0.90-Egor.jar")\
    .set("spark.executor.extraClassPath", "<path>\\\\Documents\\IntelFederal\\spark\\xgboost-jars\\xgboost4j-spark-0.90-Egor.jar")


sc = SparkContext(conf=conf,master='local[*]') # to run on single node
sc.setLogLevel('ERROR')
spark = SQLContext(sc)

spark_home = os.environ.get('SPARK_HOME', None)
sc.addPyFile(os.path.join(spark_home, 'python/lib/sparkxgb_0.83.zip'))
time.sleep(10)
from sparkxgb import XGBoostClassifier # needs sparkxgb_0.83.zip

import os, sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.7-src.zip'))

xgboost = XGBoostClassifier(
    featuresCol='features',
    labelCol='delinquency_12',
    numRound=100,
    maxDepth=8,
    maxLeaves=256,
    alpha=0.9,
    eta=0.1,
    gamma=0.1,
    subsample=1.0,
    reg_lambda=1.0,
    scalePosWeight=2.0,
    minChildWeight=30.0,
    treeMethod='hist',
    objective='reg:squarederror',
    growPolicy='lossguide',
    numWorkers=executors_per_node*nodes,
    nthread=cores_per_executor
)

Thank you a lot for any help!