How to prescribe a limit to response value according to its physical meaning?

melonki · October 2, 2019, 1:15pm

Dear Contributors and Members,
I am trying to apply XGBoost to the assessment of environmental risk, where each variables should be within the zone of reasonableness. For example. the concentration of a certain pollutant should not be less than 0. However, in my model, I got a prediction with few negative values. Is there any way to prescribe a limit to response value when training a model? Or the only thing I can do is change these negative values to zero.

I would greatly appreciate any explanation or suggestions in this matter.

PS: I’m using xgboost 1.0.0.1 in R 3.6.1.

With kind regards,
Melonki

hcho3 · October 3, 2019, 3:34am

I don’t think this is currently possible. We’d need to add new loss (objective) functions that is somehow truncated, such as Truncated normal distribution.

jrinne · October 3, 2019, 1:48pm

I thought trees did not extrapolate.

Wouldn’t this suggest that there must be some negative values for the pollutant (using your example) in the training labels?

Considering the if-else logic of a tree, how does a split arrive at a mean (rsme metric) or median (mae metric) value for a leaf of less than 0 if every single label in the training data is greater than zero?

Maybe I am missing something but if I am not it seems the answer might be to remove what would be truly legitimate outliers from the training data. Legitimate in that they are erroneous. Erroneous, as you say, in that the pollutant cannot be less than zero.

thvasilo · October 4, 2019, 1:14pm

@jrinne in boosting there might be predicted values outside the observed, see this SO question.

@melonki You can try to enforce non-negative values, for example try using a Gamma objective, if it fits your problem (for example long-tail distributed dependent), or log-transforming the dependent before training.

See the Objective part of the learning task parameters.