I created an Accelerated Failure Time model with Xgboost, but cannot figure out how to extract the model probabilities and it doesn’t seem to be covered in the documentation. predict_proba does not work.
Any ideas?
Thanks in advance.
I created an Accelerated Failure Time model with Xgboost, but cannot figure out how to extract the model probabilities and it doesn’t seem to be covered in the documentation. predict_proba does not work.
Any ideas?
Thanks in advance.
I have been struggling with this problem too. Here’s what I’ve tried so far. It looks pretty good for aft_loss_distribution = ‘normal’, but not so good for aft_loss_distribution = ‘extreme’. Would love it if an expert could chime in and mention what I could be doing wrong for aft_loss_distribution = ‘extreme’
According to the documentation,
log(Y) =T(x) + σ*Z
Here, Y is the time to death, T(x) is the output of the XGBoost algorithm, and Z is some probability distribution. Assume Z is the normal distribution. Then, in order to figure out the P(Y > Y_t), aka, the probability that the user will still be alive after time Y_t, all you have to do is the following:
distribution = np.random.normal(0, σ, 10000) + np.log(T(x))
p_survival = 1- np.sum(distribution < np.log(Y_t))/len(distribution)
As I mentioned earlier, I got pretty good results when using this for Z = normal distribution. When I say good results, I mean that the p_survival were well-calibrated for various values of Y_t. More specifically, when I bucket the data based on the magnitude of p_survival, and compute the predicted probability and the actual probability in each bucket, the predicted and actual values line up pretty well.
However, I tried the following for Z = extreme distribution, and things look pretty poorly calibrated. Any idea what I could be doing wrong?
distribution = np.random.gumbel(0,σ,10000) + np.log(T(x))
p_churn = 1-np.sum(distribution < np.log(Y_t))/len(distribution)
Thanks for reading!
You may want to consult https://towardsdatascience.com/xgbse-improving-xgboost-for-survival-analysis-393d47f1384a which discusses calibration.
Thank you! I actually tried the xgbse package briefly in the past and it was very slow unfortunately :(.
Just want to make sure-- it sounds like you don’t see a problem in my code for the extreme distribution? I’m worried that the results don’t look well calibrated because I have a bug in the code.
Are you sure np.random.gumbel(0,σ,10000)
is equivalent to σ*np.random.gumbel(0,1,10000)
? Otherwise, I can’t find any fault with your code. I’m not a statistician by training, hence I recommended XGBSE which seems to have more statistical rigor.
You are absolutely right; I got confused and just stuck the σ in as the scaling parameter, instead of multiplying it by np.random.gumbel(0,1,10000). Unfortunately after correcting the code the predicted probabilities are still poorly calibrated, so it seems to be real.