AFT Survival Model Probabilities

sgg · April 30, 2021, 2:34pm

I created an Accelerated Failure Time model with Xgboost, but cannot figure out how to extract the model probabilities and it doesn’t seem to be covered in the documentation. predict_proba does not work.

Any ideas?

Thanks in advance.

Jenn · June 7, 2021, 10:45pm

I have been struggling with this problem too. Here’s what I’ve tried so far. It looks pretty good for aft_loss_distribution = ‘normal’, but not so good for aft_loss_distribution = ‘extreme’. Would love it if an expert could chime in and mention what I could be doing wrong for aft_loss_distribution = ‘extreme’

According to the documentation,

log(Y) =T(x) + σ*Z

Here, Y is the time to death, T(x) is the output of the XGBoost algorithm, and Z is some probability distribution. Assume Z is the normal distribution. Then, in order to figure out the P(Y > Y_t), aka, the probability that the user will still be alive after time Y_t, all you have to do is the following:

    distribution = np.random.normal(0, σ, 10000) + np.log(T(x))
    p_survival = 1- np.sum(distribution < np.log(Y_t))/len(distribution)

As I mentioned earlier, I got pretty good results when using this for Z = normal distribution. When I say good results, I mean that the p_survival were well-calibrated for various values of Y_t. More specifically, when I bucket the data based on the magnitude of p_survival, and compute the predicted probability and the actual probability in each bucket, the predicted and actual values line up pretty well.

However, I tried the following for Z = extreme distribution, and things look pretty poorly calibrated. Any idea what I could be doing wrong?

    distribution = np.random.gumbel(0,σ,10000) +  np.log(T(x))
    p_churn = 1-np.sum(distribution < np.log(Y_t))/len(distribution)

Thanks for reading!

hcho3 · June 7, 2021, 10:41pm

You may want to consult https://towardsdatascience.com/xgbse-improving-xgboost-for-survival-analysis-393d47f1384a which discusses calibration.

Jenn · June 7, 2021, 10:43pm

Thank you! I actually tried the xgbse package briefly in the past and it was very slow unfortunately :(.

Just want to make sure-- it sounds like you don’t see a problem in my code for the extreme distribution? I’m worried that the results don’t look well calibrated because I have a bug in the code.

hcho3 · June 7, 2021, 10:52pm

Are you sure np.random.gumbel(0,σ,10000) is equivalent to σ*np.random.gumbel(0,1,10000) ? Otherwise, I can’t find any fault with your code. I’m not a statistician by training, hence I recommended XGBSE which seems to have more statistical rigor.

Jenn · June 8, 2021, 2:48pm

You are absolutely right; I got confused and just stuck the σ in as the scaling parameter, instead of multiplying it by np.random.gumbel(0,1,10000). Unfortunately after correcting the code the predicted probabilities are still poorly calibrated, so it seems to be real.