Jupyter, features and growing trees

Helen314 · August 19, 2022, 2:48pm

(I’d very much appreciate feedback or just plain thoughts on my weird case; thanks in advance!)

I wonder if I’m doing something wrong, not code-wise but let’s say machine-wise…

I am running an XGBClassifier on various combinations of a couple dozens different features. I get various reasonably good classification outcomes.
Then, as a sanity check, I run on one single feature, which is furthermore expected to not be crucially relevant to the task. Amazingly, I still get a classifier output that looks meaningful.

Is there any chance that the forest picks up something from previous runs?
I run on Jupyter and restart the kernel between runs. I’d like to hear from someone more knowledgable if this is adequate.

Is there any chance that the forest behaves erratically when it is given only one feature?

Other suggestions?

simpsus · August 25, 2022, 5:20pm

The ensemble does not inherit between runs.
Only way for that to happen is if you have the previous result passed to the new training so it starts from there.
If you are not continuing from a previous result and you get a good outcome, that can mean several things:
Are you validating your classifier against data that was unseen during training? If your feature is continuous and you are checking performance in sample, the accuracy will be good
It could simply mean that your feature holds predictive power, but if you say it is not related to the target … that is unlikely.
If you can post a code snippet, perhaps there is more insight there.

Helen314 · September 21, 2022, 2:06pm

My apologies for replying too late… (and thanks for your answer!)

Are you validating your classifier against data that was unseen during training?

Yes, the validation is done on data that are at the end of the dataset.

If your feature is continuous and you are checking performance in sample, the accuracy will be good

Do you mean if the validation data are chosen randomly from within the body of the training sample? (If yes, then I think this is avoided by the data being at the end of the dataset.)
Or you mean something else?

If you can post a code snippet, perhaps there is more insight there.

What kind of snippet could I include in this case? I can’t imagine anything other than the whole code, and I’m not sure this would be useful.

I’m glad that, from your answer, there is nothing obviously amiss here. I’ll keep checking.

simpsus · September 21, 2022, 4:11pm

The problem seems to me that you are using data from the training set to evaluate your classifier performance. The model has seen that data during training, and apparently your model has enough variance to overfit on the training data. That is why you are seeing such a good performance on a single feature. You are giving the model so much “time” with the data, it has memorized it.

You have to evaluate the performance of your classfiier with data that was not used during training.

The way this is usually done is you randomly sample a portion of your training set (say 20%) and not use that for training but for evaluation after training.
If you still get an unrealistically good performance, then probably there is leakage in your target. Meaning the target of sample a contains info also about the target of sample a+1. that is usually the case when your target is something like weekly return and your samples are daily values. Detecting that is impossible without semantic knowledge of the dataset.

the code I was referring to is the preparation of the training set, the training and the prediction for performance evaluation.