I’m trying to model the survival time using the Cox proportional hazard model, i would like to use xgboost to do so. I know xgboost has a coxph loss implementation, however, my loss function is a bit modified, the survival times are grouped in different experimental groups, and im actually interested in the ranking among the groups and the permutation probabilities. Each of the groups can have different sizes and there may be right censored data.
For example, suppose we have the variables, time=[10,3,1,6,4,1,7,30,21,15,25,24], groups=[0,0,0,1,1,1,1,1,2,2,2,2], observed=[1,0,1,1,1,0,1,0,0,1,1,1], where time is the survival times, groups is the group id such that the survival times can be grouped according to it, and observed is the censoring status (1 is observed data, 0 denotes censored data).
I have implemented the model with neural network with Tensorflow, but now I would like to try it in xgboost, I’m not sure what the grad and hess as returned by my loss function are in my case.
I have tried the following code, but it doesnt work, the value of loss function isnt changing:
def custom_loss(y_pred, dataset):
groups = dataset.get_group()
label = dataset.get_label()
y_pred = np.split( y_pred, np.cumsum(groups) )[:-1]
label = np.split( label, np.cumsum(groups) )[:-1]
grad = np.array([])
hess = np.array([])
for i in range(len(y_pred)):
rank = np.argsort(label[i])
scores = y_pred[i].reshape(-1,1)
mask = rank.reshape(1,-1)>=rank.reshape(-1,1)
risk_set = np.sum(mask*np.exp(scores), axis=1)
mask = rank.reshape(1,-1)<rank.reshape(-1,1)
p = np.exp(scores)/risk_set.reshape(1,-1)
grad_grouped = 1 - np.sum(mask*p, axis=1)
hess_grouped = np.sum( (-p*(1-p))*mask, axis=1)
grad = np.hstack([grad, grad_grouped])
hess = np.hstack([hess, hess_grouped])
return grad, hess
def custom_eval(y_pred, dataset):
groups = dataset.get_group()
label = dataset.get_label()
y_pred = np.split( y_pred, np.cumsum(groups) )[:-1]
label = np.split( label, np.cumsum(groups) )[:-1]
likelihoods = np.array([])
for i in range(len(y_pred)):
rank = np.argsort(label[i])
scores = y_pred[i].reshape(-1,)
mask = rank.reshape(1,-1)>=rank.reshape(-1,1)
risk_set = np.sum(mask*np.exp(scores), axis=1)
lik = np.log( np.exp(scores)/risk_set )
likelihoods = np.hstack([likelihoods, lik])
loss = -np.sum(likelihoods)/len(y_pred)
return 'cox_loss', loss, False
i have also implemented in Tensorflow for the neural network version:
def compute_cox_loss(time, predict, observed, groups):
'''predict is the score output by the model,
other param are the same as in the problem description
'''
#ensure the data have the correct shape
time = tf.reshape(time, (-1,))
predict = tf.reshape(predict, (-1,))
observed = tf.reshape(observed, (-1,))
#split the data into groups
splitted_time = tf.split(time, groups, axis=0)
splitted_predict = tf.split(predict, groups, axis=0)
splitted_observed = tf.split(observed, groups, axis=0)
#target is the total loss
target = 0
count = 1
batch_size = len(splitted_time)
#for each group, calculate the CoxPH loss and add it into target
for i in range(batch_size):
sorted_time = tf.sort(splitted_time[i], direction='ASCENDING')
sort_arg = tf.argsort(splitted_time[i], direction='ASCENDING')
sorted_predict = tf.exp( tf.gather(splitted_predict[i], sort_arg) )
sorted_observed = tf.gather(splitted_observed[i], sort_arg)
mask = tf.cast( sorted_observed, tf.bool )
L = tf.boolean_mask( tf.log(sorted_predict) - tf.log( tf.cumsum(sorted_predict, reverse=True) ), mask )
loss = -tf.reduce_mean(L)
target += loss
count += 1
return target/count
Also, I cannot find the code implementing the coxph loss in xgboost in the official github. Can anyone point out where the code is so that i can modify it according to my problem?