i am using xgboost classifer on my data for binary classification. therefore i am using the “binary:logistic” loss function where the loss function in this is the log likelihood of the bernoulli distribution… since boosting builds each tree on previous trees’ residuals/errors so biger outliers will lead to bigger residuals.
however most of my data is postively skewed so i am confused on an approach to remove outliers. when i apply a transformation such as yeo-johnson (because my data also have negative values) it reduces the skew from some columns having a skew of 400 to -36 for example. this is a huge reduction in the skew.
then when i use a quantile method to remove outliers e.g. those at >99th percentile or less than <1st percentile it actually results in 50% loss of my training data which i don’t want. so the above doesn’t seem applicable to my data(and this was based on just one column).
to minimize effects of outliers i might just winsorize them.
does anybody have any thoughts on the above/ ideas what else i could do? / can anybody direct me to some resources that may help