Survival:cox and panel data setup and xgboost

xgboost_fan · April 26, 2020, 8:03pm

I am looking for some help on how to setup my data to use xgboost properly with panel data.

I find it difficult to see how to setup my data in an analogous way to: https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html or https://github.com/dmlc/xgboost/blob/master/demo/aft_survival/aft_survival_demo.py
as every given example is a cross-sectional data set.

Attached is an example of a panel survival data set that has been setup for traditional cox proportional hazard models, as it contains y (demend) and start and end times which are sought by the survival functions:

km.curve <- Surv(time = df$start_time, time2 = df$end_time, event = df$demend)

# load data
df = pd.read_csv('https://raw.githubusercontent.com/afogarty85/replications/master/Maeda/maeda2.csv')

df[['demend', 'start_time', 'end_time']].head(10)

demend	start_time	end_time
0	0	0	1
1	0	1	2
2	0	2	3
3	0	3	4
4	0	4	5
5	0	5	6
6	0	6	7
7	0	7	8
8	0	8	9
9	0	9	10

I am wondering how to setup my lower and upper bounds and DMatrix objects. Thanks for your time and consideration!

hcho3 · April 27, 2020, 1:31am

I’m not aware of any use case of XGBoost for time series (panel) data. XGBoost will optimize a likelihood function over a training data, and the training data is assumed to be i.i.d. Hence the tutorial shows cross-sectional example.

One solution is to use XGBoost to fit the transition function f so that f(x_{i-1}) = x_i where x_i indicates value of covariates and the survival time. (Multi-output regression is not supported, so you’d need to fit multiple models.) Time is to be treated as discrete steps. You will be assuming that the data has somehow reached a steady state.

xgboost_fan · April 27, 2020, 1:36am

Thanks for the information!

avinashbarnwal · April 27, 2020, 1:54am

Generally panel data for survival modeling is synonymous to Time-Varying covariates survival modeling and Time-Varying covariates are dealt different. For more information/idea - check -https://www.jstatsoft.org/article/view/v061c01/v61c01.pdf. I don’t think we can model Time-Varying Covariates/ Panel Data Survival Modeling using current framework.

xgboost_fan · April 28, 2020, 7:31pm

I am wondering if anyone can provide recommendations on how to generate predictions for survival, like the one here: https://github.com/dmlc/xgboost/blob/master/demo/aft_survival/aft_survival_viz_demo.py (it would be great if this could be incorporated into something more general for use)

I am having trouble editing the code to get it to work for a matrix different from the given size:

X = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))

This line in particular seems to be the right metric to generate, but I keep getting length mismatch errors:

acc = np.sum(np.logical_and(y_pred >= y_lower, y_pred <= y_upper)/len(X) * 100)

hcho3 · April 28, 2020, 8:24pm

@xgboost_fan We have a tutorial available: https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html#how-to-use. The length of y_lower and y_upper needs to match the number of rows in the data matrix X.