XGBoostJ Multiclass classification Java

I am using XGBoostJ for a multiclass classification problem with 4 attributes currently(would be increasing it further soon). Since the code help for java implementing XGBoost seems very limited, I figured out how to get it working and did it. I have a few questions a about which I am not very sure, so I thought I can ask here. Currently this is my implementation for training,

Map<String, DMatrix> watches = new HashMap<String, DMatrix>() {
    {
        put("train", trainigDM);
        put("test", validationDM);
    }
};

Booster booster = XGBoost.train(trainingDMatrix, params, 100, watches, null, null);

And this is my configuration,

num_rounds = 100
params.put("objective", "multi:softmax");
params.put("verbosity", 1);
params.put("eta", 0.3);
params.put("alpha", 2);
params.put("lambda", 3);
params.put("gamma", 0);
params.put("num_class", 4);

I also test it to get the accuracy using,

float pred[][] = booster.predict(testDM);

The validation set is 20% of the input, test is 10% of the input and te training set is the remaining 70%. Ofcourse, it is shuffled and there is no pattern in the input.

My question are,
I use this constructor of DMatrix to create it since I receive the input as REST call,

DMatrix(float[] data, int num_rows, int num_cols);

But since there is a log of categorical and string features in my data, my feature set becomes huge and I crash since I run out of memory because I encode string and categorical data using one hot encoding(my own implementation). How can I work around this? Is there a converter to libsvm format so that I can use that maybe? Whats a good solution to this?

Why is the predict returning a float[][] instead of a float[]? Is this because if my result(class) is a vector it can be returned? My feature is currently being label encoded. Is that wrong?

Is there a way I can draw curves to evaluate and see if I am facing overfitting or I am underfitting?

Also, very importantly I see that my errors after the training seem to finish at,

[84]	test-merror:0.636145	train-merror:0.490371

What is considered a good error? Is something like test-merror:0.111 and train-merror:0.111 a good value to aim for? I am asking this since I am not able to figure(or search online for) out a good metric for these numbers, to consider them good or bad.

@Zhang-Liao @CodingCat xxxxz hcho3 Can you please help me with this? The main concerns I have are,

  • My matrix is being too huge(sparse) because most of my attributes are either String or Category attributes. The application crashes because of memory exceeded because of this. If I convert it to libsvm format(using my own implementation) how should I pass the test matrix, what would the class for those be?

  • Please verify if the steps I am following to create the model and get predictions is right? I am requesting this since I created my code by using help from various places. Just want to make sure I am doing the right thing.

Thanks,
Ram

I can’t help you with the Java code, since I’m not familiar with it. As for the first question, I’m afraid that using libsvm would actually help. The issue is really the high cardinality of categorical features. You have the memory problem because 1) XGBoost requires one-hot encoding of categorical features, and 2) the memory footprint of XGBoost is directly proportional to the number of features. So with lots of dummy binary features, XGBoost takes up lots of memory.

You have a few options:

  1. Get a machine with more memory.
  2. Consolidate categories so that you have fewer distinct categories.
  3. Use another package that handles categorical features without the need for one-hot encoding, such as LightGBM. Your option may be limited here, since you are using Java, not Python.

Note on 3) We are hoping to implement something similar, but we are not there yet. Subscribe to https://github.com/dmlc/xgboost/issues/6503 to check the progress.

Thanks @hcho3! I can try LightGBM but would that be as effective as XGBoost?

Would that be as effective as XGBoost?

It depends on your particular application. Also, you should check if LightGBM offers a convenient Java interface.

1 Like

You can create the DMatrix in compressed sparse row format (i.e. the format libsvm stores things in). You can see an example of this in Tribuo’s XGBoost interface - https://github.com/oracle/tribuo/blob/main/Common/XGBoost/src/main/java/org/tribuo/common/xgboost/XGBoostTrainer.java#L277 where it converts from Tribuo’s input format (which is compressed sparse row) into a CSR DMatrix. We also have loaders for libsvm format data directly if that’s useful to you. Tribuo also has a full set of classification evaluation metrics so you can see if the loss lines up with what you want.

Tribuo doesn’t currently support the watches or early stopping functionality in XGBoost, but I’m in the middle of upgrading it to XGBoost 1.3.2 from 1.0.0 so it will gain some additional functionality as we build out the interface.

1 Like

@Craigacp Sure thanks! That would be very useful for me. Since I receive the input(feature values - Strings, Categories, Numbers, Boolean) as a List<List> which is then converted to create the DMatrix, I think I would have to write my own converter to sparse row format. Is there already a converter that I can use to convert it? If that is not directly available off the shelf, I think I can write my own converter.
Thanks!

I don’t think there is one in the XGBoost4j library. We wrote our own for Tribuo, and as you can see it’s fairly straightforward. It’s basically just managing the feature domain and associated indices. This can be a little tricky to make work consistently at test time, but it’s not too hard.

Sure thanks @Craigacp, I will write my own. I was just thinking if there was one already available I would not have to reinvent the wheel, and also use something that was already proved to work consistently. But I think that should be fine. I will write my own implementation! Thanks a lot @Craigacp!