Survival analysis with multiple observations for each object

vzhilov · October 2, 2022, 4:05am

Unlike the XGBoost documentation my dataset contains many observations for each patient, so when a patient health changes new measurements are logged.

I’m trying to get benefit from that abundance of data and I fit XGBoost model with all that. I do that without differentiating patients from each other.

So my dataset looks like this:

Patient	A	B	C	D	E	F	daysOfObserv	daysToEvent
x1	1	3	5	2	8	1	10	364
x1	17	9	2	4	23	1	20	211
x1	8	6	4	6	3	2	56	30
x2	3	5	5	4	13	66	13	121

I drop the Patient column. daysToEvent goes to y_training_value and then also get dropped before training.

import xgboost as xgb
import pandas as pd

df = pd.read(‘data.csv’)
y_train_lower_bound = df[‘daysToEvent’].values
y_train_upper_bound = df[‘daysToEvent’].values
df = df.drop([‘Patient’, ‘daysToevent’], axis=1)
x_train = xgb.DMatrix(df)
x_train.set_float_info(‘label_lower_bound’, y_train_lower_bound)
x_train.set_float_info(‘label_upper_bound’, y_train_upper_bound)

params = {'objective': 'survival:aft',
          'eval_metric': 'aft-nloglik',
          'aft_loss_distribution': 'normal',
          'aft_loss_distribution_scale': 1.20,
          'tree_method': 'hist', 'learning_rate': 0.05, 'max_depth': 2}
bst = xgb.train(params, x_train_xgb, num_boost_round=100, #feval=rmsle,
                evals=[(x_train_xgb, 'train')])

Output:
[0]	train-aft-nloglik:15.65987
[1]	train-aft-nloglik:14.39717
[2]	train-aft-nloglik:13.25554
[3]	train-aft-nloglik:12.22325
[4]	train-aft-nloglik:11.28966
[5]	train-aft-nloglik:10.44522
[6]	train-aft-nloglik:9.68131
[7]	train-aft-nloglik:8.99034
[8]	train-aft-nloglik:8.36482
[9]	train-aft-nloglik:7.79868
[10]	train-aft-nloglik:7.28619
[11]	train-aft-nloglik:6.82221
[12]	train-aft-nloglik:6.40206
[13]	train-aft-nloglik:6.02054
[14]	train-aft-nloglik:5.67580
[15]	train-aft-nloglik:5.36259
[16]	train-aft-nloglik:5.07880
[17]	train-aft-nloglik:4.82224
[18]	train-aft-nloglik:4.58903
[19]	train-aft-nloglik:4.37760
[20]	train-aft-nloglik:4.18603
[21]	train-aft-nloglik:4.01213
[22]	train-aft-nloglik:3.85469
[23]	train-aft-nloglik:3.71190
[24]	train-aft-nloglik:3.58228
...
[96]	train-aft-nloglik:2.27551
[97]	train-aft-nloglik:2.27501
[98]	train-aft-nloglik:2.27399
[99]	train-aft-nloglik:2.27356

But later, at the prediction stage I don’t get the satisfied results. Is it because of my approach of fitting multiple lines for each patient? What would be the correct approach - only one observation per the patient in XGBoost? Do I need to look for another model to fit multiple observations for one object?