Inconsistency in scores of Prediction using R and Java

venky177 · January 31, 2019, 10:22pm

I Train the model in R store it in a binary format and the load in Java.
I was currently training a model in R 3.2.2 and Xgboost 0.4-2 and Predict using Xgboost JVM 0.72.
Both Predict functions in R and Java give me the same results for my input data.
BaseLine
8.847367984882893E-6
0.09427586244760379
4.262344683277366E-4
4.970554458048691E-4
0.004658706118213614

I updated R to 3.5.1 and Xgboost to 0.71.1 and 0.71.2 (Tried both), and Predicted using Xgboost JVM 0.72/0.81(Tried Both)
The R predict function gave this
1.37787336598784e-05
9.53892665644611e-02
2.01653583650557e-04
3.29090567502518e-04
6.10666029073281e-03

The Java Predictions gave this
0.29515123439824215
0.9995756389884775
0.9507827625419463
0.9846390802292272
0.9981499148687549

Is there some inconsistency in the R and Java prediction functions or am I missing something?

Training Environment
readr_1.1.0.tar.gz &&
jsonlite_1.4.tar.gz &&
xgboost_0.71.2.tar.gz &&
chron_2.3-47.tar.gz &&
data.table_1.10.4-3.tar.gz &&
magrittr_1.5.tar.gz &&
stringr_1.0.0.tar.gz &&
stringi_0.5-5.tar.gz &&
bindrcpp_0.2.tar.gz &&
tibble_1.3.1.tar.gz &&
BH_1.62.0-1.tar.gz &&
R6_2.2.0.tar.gz &&
hms_0.3.tar.gz &&
assertthat_0.2.0.tar.gz &&
rlang_0.1.4.tar.gz &&
Rcpp_0.12.17.tar.gz && \

R Predict
Sigmoid <- function(x) {
return (exp(x) / (exp(x) + 1))
}
raw.score = predict(model,as.matrix(x))
raw.score = Sigmoid(raw.score)

Java Predict
private double Sigmoid(double score)
{
return Math.exp(score) / (Math.exp(score) + 1);
}
Booster booster = XGBoost.loadModel(“model.bin”);
DMatrix dMatrix = new DMatrix(fvec, 1, fvec.length);
float[][] prediction = booster.predict(dMatrix);
return Sigmoid(prediction[0][0]);

There is no code change in both of training and prediction. Is there something I am missing?

I also tried predicting models trained on R 3.2.2 Xgb 0.4-2 with R 3.5.1 Xgb 0.71.2 and go following results:
0.282819604863266
0.997937636304315
0.970238370838886
0.979681025161069
0.997780969263874

which is similar to Predictions trained on R3.5.1 Xgb 0.71.2 and predicting in XGB Jvm 0.72.

hcho3 · February 1, 2019, 12:39am

How are you loading your test data in Java?

venky177 · February 1, 2019, 8:57am

I read the csv and then convert double to float before passing it to the dmatrix
public List<double []> getDataFromCSV(String path){

    List<double[]> data = new ArrayList<>();
    BufferedReader br = null;
    String line = "";
    String cvsSplitBy = ",";

    try {

        br = new BufferedReader(new FileReader(path));
        br.readLine();
        while ((line = br.readLine()) != null) {

            // use comma as separator
            String[] columns = line.split(cvsSplitBy);
            double [] valsOfCols = Arrays.stream(columns)
                .mapToDouble(Double::parseDouble)
                .toArray();;
            data.add(valsOfCols);
        }

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (br != null) {
            try {
                br.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    return data;
}

private float[] toFloat(double [] doubleArray)
{
    float[] floatArray = new float[doubleArray.length];
    for (int i = 0 ; i < doubleArray.length; i++)
    {
        floatArray[i] = (float) doubleArray[i];
    }
    return floatArray;
}

venky177 · February 1, 2019, 8:54pm

I tried running R 3.5.1 Xgb 0.8.0.1 Training and predicted using jvm 0.81 still get the same results

hcho3 · February 2, 2019, 12:12am

Does your data have missing values? See https://github.com/dmlc/xgboost/issues/3634.

venky177 · February 2, 2019, 12:20am

Hey, we don’t have any missing values. I did try setting it to 0 while reading to Dmatrix, just now. It did not help. It still produces the same score.

hcho3 · February 2, 2019, 2:16am

Can you post your model here?

venky177 · February 4, 2019, 7:32pm

Hey, we cannot share our model but are coming up with a way to overcome that. I tried running some toy models just to see if they work in both R and Java.
Looks like they do have same results. Our data has some features which may be hitting some edge case of which causes the models to produce different results.
Does this sound familiar to anything you have encountered before ?

hcho3 · February 4, 2019, 8:02pm

@venky177 This may be relevant: https://github.com/dmlc/xgboost/issues/3960#issuecomment-447234404. Make sure that you are converting your input data into 32-bit floating-point type.

venky177 · February 5, 2019, 12:47am

Hello, so I saved the data in Dmatrix.buffer file in R and loaded the file in Java and it gave me the correct result.
Thus the difference is only in the way we are reading data to a dmatrix in Java. Can you spot the bug in the Java code?
How do we convert a float array to a Dmatrix correctly ?

hcho3 · February 5, 2019, 12:46am

Interesting. Can you post a snippet of your data? Is it in CSV or LIBSVM?

venky177 · February 5, 2019, 6:56am

Is it possible that as R trained with Index starting from 1 and java predicts with index starting from 0, the predictions may be mismatched.

“Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However, when you do
prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is using
0-based indexing (feature indices starting with 0) by default. It creates a pitfall for the users who train model with
Spark but predict with the dataset in the same format in other bindings of XGBoost. The solution is to transform the
dataset to 0-based indexing before you predict with, for example, Python API, or you append ?indexing_mode=1
to your file path when loading with DMatirx. For example in Python:
xgb.DMatrix(‘test.libsvm?indexing_mode=1’)”

My data is in a csv form. with high floating point values 0.(18 Decimal places). Most of the columns are like that and some have 0s and 1s with some doubles 843.12 etc.

This is in double format and then the toFloat function converts it to float. This float vector is fed in the Dmatrix constructor per row and predictions are made.

venky177 · February 5, 2019, 7:52am

int[] colIndex = new int[fvec.length];
for(int i=0;i<colIndex.length;i++)colIndex[i]=i;
DMatrix dMatrix = new DMatrix(new long[]{0,fvec.length},colIndex,fvec,DMatrix.SparseType.CSR,fvec.length);
float[][] prediction = booster.predict(dMatrix);

This works with Models Trained on R 3.5.1 Xgb 0.71.1 but not with R 3.2.2 Xgb 0.4-2

        DMatrix dMatrix = new DMatrix(fvec, 1, fvec.length);

This does not work with Models Trained on R 3.5.1 Xgb 0.71.1 but works with R 3.2.2 Xgb 0.4-2

petestorey26 · January 2, 2020, 11:05am

I managed to solve this with a painful bit of a hack, creating a libsvm file and then loading that from a Map of <Integer, Integer> being the columns and values:

        Random r = new Random(); // Use a random to hopefully prevent a name clash (already unlikely)
        Path path = Files.createTempFile("mlinput-" + System.currentTimeMillis() + "-" + r.nextInt(100), ".libsvm");
        File inputFile = path.toFile();

        // So, create the libsvm format string, which is say `0 0:1 1:0 2:1 3:0 4:1999 5:0` etc
        StringBuffer libsvmText = new StringBuffer();
        libsvmText.append("0 "); // Need a "label" as the first thing in the libsvm format - it doesn't do anything though as far as I can see
        map.entrySet().stream()
                .filter(e -> e.getKey() != null && e.getValue() != null) // Filter any null values that have crept in there
                .forEach(e -> libsvmText.append(e.getKey() + ":" + e.getValue() + " "));

        Files.write(path, libsvmText.toString().getBytes(StandardCharsets.UTF_8));

        DMatrix input = new DMatrix(inputFile.getAbsolutePath());

        inputFile.delete();

        float[][] predicts = model.predict(input);

The model was just loaded directly and it now gives the same outputs as R does (and for that matter Python with some similar monkeying about to load an R model).