Unlike the XGBoost documentation my dataset contains many observations for each patient, so when a patient health changes new measurements are logged.
I’m trying to get benefit from that abundance of data and I fit XGBoost model with all that. I do that without differentiating patients from each other.
So my dataset looks like this:
Patient | A | B | C | D | E | F | daysOfObserv | daysToEvent |
---|---|---|---|---|---|---|---|---|
x1 | 1 | 3 | 5 | 2 | 8 | 1 | 10 | 364 |
x1 | 17 | 9 | 2 | 4 | 23 | 1 | 20 | 211 |
x1 | 8 | 6 | 4 | 6 | 3 | 2 | 56 | 30 |
x2 | 3 | 5 | 5 | 4 | 13 | 66 | 13 | 121 |
I drop the Patient
column. daysToEvent
goes to y_training_value
and then also get dropped before training.
import xgboost as xgb
import pandas as pd
df = pd.read(‘data.csv’)
y_train_lower_bound = df[‘daysToEvent’].values
y_train_upper_bound = df[‘daysToEvent’].values
df = df.drop([‘Patient’, ‘daysToevent’], axis=1)
x_train = xgb.DMatrix(df)
x_train.set_float_info(‘label_lower_bound’, y_train_lower_bound)
x_train.set_float_info(‘label_upper_bound’, y_train_upper_bound)
params = {'objective': 'survival:aft',
'eval_metric': 'aft-nloglik',
'aft_loss_distribution': 'normal',
'aft_loss_distribution_scale': 1.20,
'tree_method': 'hist', 'learning_rate': 0.05, 'max_depth': 2}
bst = xgb.train(params, x_train_xgb, num_boost_round=100, #feval=rmsle,
evals=[(x_train_xgb, 'train')])
Output:
[0] train-aft-nloglik:15.65987
[1] train-aft-nloglik:14.39717
[2] train-aft-nloglik:13.25554
[3] train-aft-nloglik:12.22325
[4] train-aft-nloglik:11.28966
[5] train-aft-nloglik:10.44522
[6] train-aft-nloglik:9.68131
[7] train-aft-nloglik:8.99034
[8] train-aft-nloglik:8.36482
[9] train-aft-nloglik:7.79868
[10] train-aft-nloglik:7.28619
[11] train-aft-nloglik:6.82221
[12] train-aft-nloglik:6.40206
[13] train-aft-nloglik:6.02054
[14] train-aft-nloglik:5.67580
[15] train-aft-nloglik:5.36259
[16] train-aft-nloglik:5.07880
[17] train-aft-nloglik:4.82224
[18] train-aft-nloglik:4.58903
[19] train-aft-nloglik:4.37760
[20] train-aft-nloglik:4.18603
[21] train-aft-nloglik:4.01213
[22] train-aft-nloglik:3.85469
[23] train-aft-nloglik:3.71190
[24] train-aft-nloglik:3.58228
...
[96] train-aft-nloglik:2.27551
[97] train-aft-nloglik:2.27501
[98] train-aft-nloglik:2.27399
[99] train-aft-nloglik:2.27356
But later, at the prediction stage I don’t get the satisfied results. Is it because of my approach of fitting multiple lines for each patient? What would be the correct approach - only one observation per the patient in XGBoost? Do I need to look for another model to fit multiple observations for one object?