Inconsistency in scores of Prediction using R and Java


#1

I Train the model in R store it in a binary format and the load in Java.
I was currently training a model in R 3.2.2 and Xgboost 0.4-2 and Predict using Xgboost JVM 0.72.
Both Predict functions in R and Java give me the same results for my input data.
BaseLine
8.847367984882893E-6
0.09427586244760379
4.262344683277366E-4
4.970554458048691E-4
0.004658706118213614

I updated R to 3.5.1 and Xgboost to 0.71.1 and 0.71.2 (Tried both), and Predicted using Xgboost JVM 0.72/0.81(Tried Both)
The R predict function gave this
1.37787336598784e-05
9.53892665644611e-02
2.01653583650557e-04
3.29090567502518e-04
6.10666029073281e-03

The Java Predictions gave this
0.29515123439824215
0.9995756389884775
0.9507827625419463
0.9846390802292272
0.9981499148687549

Is there some inconsistency in the R and Java prediction functions or am I missing something?

Training Environment
readr_1.1.0.tar.gz &&
jsonlite_1.4.tar.gz &&
xgboost_0.71.2.tar.gz &&
chron_2.3-47.tar.gz &&
data.table_1.10.4-3.tar.gz &&
magrittr_1.5.tar.gz &&
stringr_1.0.0.tar.gz &&
stringi_0.5-5.tar.gz &&
bindrcpp_0.2.tar.gz &&
tibble_1.3.1.tar.gz &&
BH_1.62.0-1.tar.gz &&
R6_2.2.0.tar.gz &&
hms_0.3.tar.gz &&
assertthat_0.2.0.tar.gz &&
rlang_0.1.4.tar.gz &&
Rcpp_0.12.17.tar.gz && \

R Predict
Sigmoid <- function(x) {
return (exp(x) / (exp(x) + 1))
}
raw.score = predict(model,as.matrix(x))
raw.score = Sigmoid(raw.score)

Java Predict
private double Sigmoid(double score)
{
return Math.exp(score) / (Math.exp(score) + 1);
}
Booster booster = XGBoost.loadModel(“model.bin”);
DMatrix dMatrix = new DMatrix(fvec, 1, fvec.length);
float[][] prediction = booster.predict(dMatrix);
return Sigmoid(prediction[0][0]);

There is no code change in both of training and prediction. Is there something I am missing?

I also tried predicting models trained on R 3.2.2 Xgb 0.4-2 with R 3.5.1 Xgb 0.71.2 and go following results:
0.282819604863266
0.997937636304315
0.970238370838886
0.979681025161069
0.997780969263874

which is similar to Predictions trained on R3.5.1 Xgb 0.71.2 and predicting in XGB Jvm 0.72.


#2

How are you loading your test data in Java?


#3

I read the csv and then convert double to float before passing it to the dmatrix
public List<double []> getDataFromCSV(String path){

    List<double[]> data = new ArrayList<>();
    BufferedReader br = null;
    String line = "";
    String cvsSplitBy = ",";

    try {

        br = new BufferedReader(new FileReader(path));
        br.readLine();
        while ((line = br.readLine()) != null) {

            // use comma as separator
            String[] columns = line.split(cvsSplitBy);
            double [] valsOfCols = Arrays.stream(columns)
                .mapToDouble(Double::parseDouble)
                .toArray();;
            data.add(valsOfCols);
        }

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (br != null) {
            try {
                br.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    return data;
}

private float[] toFloat(double [] doubleArray)
{
    float[] floatArray = new float[doubleArray.length];
    for (int i = 0 ; i < doubleArray.length; i++)
    {
        floatArray[i] = (float) doubleArray[i];
    }
    return floatArray;
}

#4

I tried running R 3.5.1 Xgb 0.8.0.1 Training and predicted using jvm 0.81 still get the same results


#5

Does your data have missing values? See https://github.com/dmlc/xgboost/issues/3634.


#6

Hey, we don’t have any missing values. I did try setting it to 0 while reading to Dmatrix, just now. It did not help. It still produces the same score.


#7

Can you post your model here?


#8

Hey, we cannot share our model but are coming up with a way to overcome that. I tried running some toy models just to see if they work in both R and Java.
Looks like they do have same results. Our data has some features which may be hitting some edge case of which causes the models to produce different results.
Does this sound familiar to anything you have encountered before ?


#9

@venky177 This may be relevant: https://github.com/dmlc/xgboost/issues/3960#issuecomment-447234404. Make sure that you are converting your input data into 32-bit floating-point type.


#10

Hello, so I saved the data in Dmatrix.buffer file in R and loaded the file in Java and it gave me the correct result.
Thus the difference is only in the way we are reading data to a dmatrix in Java. Can you spot the bug in the Java code?
How do we convert a float array to a Dmatrix correctly ?


#11

Interesting. Can you post a snippet of your data? Is it in CSV or LIBSVM?


#12

Is it possible that as R trained with Index starting from 1 and java predicts with index starting from 0, the predictions may be mismatched.

“Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However, when you do
prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is using
0-based indexing (feature indices starting with 0) by default. It creates a pitfall for the users who train model with
Spark but predict with the dataset in the same format in other bindings of XGBoost. The solution is to transform the
dataset to 0-based indexing before you predict with, for example, Python API, or you append ?indexing_mode=1
to your file path when loading with DMatirx. For example in Python:
xgb.DMatrix(‘test.libsvm?indexing_mode=1’)”

My data is in a csv form. with high floating point values 0.(18 Decimal places). Most of the columns are like that and some have 0s and 1s with some doubles 843.12 etc.

This is in double format and then the toFloat function converts it to float. This float vector is fed in the Dmatrix constructor per row and predictions are made.


#13

int[] colIndex = new int[fvec.length];
for(int i=0;i<colIndex.length;i++)colIndex[i]=i;
DMatrix dMatrix = new DMatrix(new long[]{0,fvec.length},colIndex,fvec,DMatrix.SparseType.CSR,fvec.length);
float[][] prediction = booster.predict(dMatrix);

This works with Models Trained on R 3.5.1 Xgb 0.71.1 but not with R 3.2.2 Xgb 0.4-2

        DMatrix dMatrix = new DMatrix(fvec, 1, fvec.length);

This does not work with Models Trained on R 3.5.1 Xgb 0.71.1 but works with R 3.2.2 Xgb 0.4-2