How does XGBoost only predict the likelihood of an insurance claim?

gongkuie · April 28, 2020, 9:07am

Hello, I’m reading XGBoost: A Scalable Tree Boosting System, and not clear about paragraph below:
“The first dataset we use is the Allstate insurance claim dataset8. The task is to predict the likelihood and cost of an insurance claim given different risk factors. In the experiment, we simplified the task to only predict the likelihood of an insurance claim. This dataset is used to evaluate the impact of sparsity-aware algorithm in Sec. 3.4. Most of the sparse features in this data come from one-hot encoding. We randomly select 10M instances as training set and use the rest as evaluation set.”
The Allstate insurance claim dataset label is the amount of claims, by so how can I get to predict the likelihood of claim?

hcho3 · May 2, 2020, 7:29am

You can treat it as binary classification problem, where the zero claim amount is coded as 0 label and the non-zero claim amount is coded as 1 label.

gongkuie · May 4, 2020, 3:09am

I downloaded the dataset from kaggle Allstate Claims Severity competition page. From the train.csv file I can see the smallest value of loss is 20.99. I am sorry I am new to kaggle. Can you tell me where can I find these zero claims?

hcho3 · May 4, 2020, 3:30am

You are looking at a different data set. Try https://www.kaggle.com/c/ClaimPredictionChallenge, which is linked by the XGBoost paper.