After reading this tutorial, you will be able to calculate the sGIST for different traits and mGIST for the whole trial.
First, let’s get started with importing libraries:
import os
import pandas as pd
import numpy as np
from sklearn.svm import SVR
import functools
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
%matplotlib inline
pd.set_option('display.max_rows', 500)
Import and explore data
Load data as a dataframe:
# Get the absolute path of the file
file_path = os.path.abspath("CRC_AE_FINAL.xlsx")
# Load data as a dataframe
tp_data = pd.read_excel(file_path)
tp_data.shape
Take a look at the first five rows of the dataframe:
tp_data.head()
Data Preprocessing
Replace all the Null value to Nan:
tp_data = tp_data.replace('NULL', np.nan)
Transfer the categorical data columns to category data type:
for col in [
'coagulation',
'brain_metastases',
'angina',
'MI',
'HF',
'ureteral_stents',
'pregnant',
'Oxaliplatin']:
tp_data[col] = tp_data[col].astype('category')
Exploring missing values and handling missing values
def check_missing(data):
total = data.isnull().sum().sort_values(ascending=False)
percentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percentage*100], axis=1, keys=['# of Missing', 'percentage %'])
return missing_data
print(check_missing(tp_data))
Note that we have 2772 observations in total.
We can see from the output that column ‘6301-6’, ‘1920-8′,’SYSTOLIC’ and ‘DIASTOLIC’ are missing most of the values, while other columns have no missing values.
Dealing with missing values
imp = SimpleImputer(missing_values=np.nan, strategy='median')
# Filling these missing values with their median
imp_list = ['6301-6','1920-8','SYSTOLIC','DIASTOLIC']
imp.fit(tp_data[imp_list])
tp_data_imputed = tp_data
tp_data_imputed[imp_list] = imp.transform(tp_data[imp_list])
Ceck the missing values again
print(check_missing(tp_data_imputed))
Check one of the imputed column
tp_data_imputed['1920-8'].head()
Calculation
Define a class that is able to calculate part of GIST
# Eligibility class
class Elig:
def __init__(self,
var_name: str = None, # variable name
is_range: bool = True, # True means continuous variable
low_bound: float = None, # lower bound for eligible creteria
upp_bound: float = None, # upper bound for eligible creteria
equal_to: float = None): # For categorical variables, 0 for presence, 1 for absence
self.var_name = var_name
self.is_range = is_range
self.low_bound = low_bound
self.upp_bound = upp_bound
self.equal_to = equal_to
# Check if a patient meets a certain criterion or not: elig_status return True or False.
def get_elig(self, data):
if self.is_range:
if self.low_bound is None and self.upp_bound is not None:
elig_status = (data[self.var_name] <= self.upp_bound)
if self.low_bound is not None and self.upp_bound is None:
elig_status = (data[self.var_name] >= self.low_bound)
if self.low_bound is not None and self.upp_bound is not None:
elig_status = ((data[self.var_name] >= self.low_bound) & (data[self.var_name] <= self.upp_bound))
else:
elig_status = (data[self.var_name] == self.equal_to)
return elig_status
# Calculate the stringency of a certain criterion
def get_stringency(self, data):
if self.is_range:
elig_n = data.loc[(data[self.var_name] >= self.low_bound) & (data[self.var_name] <= self.upp_bound), self.var_name].shape[0]
else:
elig_n = data.loc[(data[self.var_name] == self.equal_to), self.var_name].shape[0]
stringency = 1 - elig_n / data.shape[0]
return max(stringency, 0.01)
# Calculate sGIST (single trait GIST) for a particular criterion
def get_gist(self, data):
gist = sum(data.loc[self.get_elig(data), 'f_weight']) / sum(data.f_weight)
return gist
Set up the eligibility criteria used for calculating GIST 2.0
Tle_AGE_LAST_PX = Elig(var_name='AGE_LAST_PX',low_bound=18)
T1e_brain_metastases = Elig(var_name='brain_metastases',is_range=False,equal_to=0)
T1e_coagulation = Elig(var_name='coagulation',is_range=False,equal_to=0)
T1e_angina = Elig(var_name='angina',is_range=False,equal_to=0)
T1e_MI = Elig(var_name='MI',is_range=False,equal_to=0)
T1e_HF = Elig(var_name='HF',is_range=False,equal_to=0)
T1e_ureteral_stents = Elig(var_name='ureteral_stents',is_range=False,equal_to=0)
T1e_pregnant = Elig(var_name='pregnant',is_range=False,equal_to=0)
T1e_Oxaliplatin = Elig(var_name='Oxaliplatin',is_range=False,equal_to=1)
T1e_DIASTOLIC = Elig(var_name='DIASTOLIC',upp_bound=100)
T1e_SYSTOLIC = Elig(var_name='SYSTOLIC',upp_bound=150)
T1e_1920_8 = Elig(var_name='1920-8',upp_bound=125)
T1e_6301_6 = Elig(var_name='6301-6',upp_bound=1.5)
Get the list of eligibility criteria
# Categorical eligible traits list
cate_elig_list = [T1e_brain_metastases,T1e_coagulation,T1e_angina,T1e_MI,T1e_HF,T1e_ureteral_stents,T1e_pregnant]
# Continuous eligible traits list
cont_elig_list = [Tle_AGE_LAST_PX]
# Eligible list
elig_list = cate_elig_list + cont_elig_list
Drop the rows with missing values
TP = tp_data_imputed[[
'AGE_LAST_PX',
'brain_metastases',
'coagulation',
'angina',
'MI',
'HF',
'ureteral_stents',
'pregnant']].dropna()
TP.shape
After dropping rows with missing values, we found out that the number of rows remain the same, which means that there were no missing values. Now, let’s take a look at what the dataframe look like.
TP.head()
Now we get data ready and let’s check out how to calculate GIST.
GIST Calculation Algorithm
1. Standardize the summarized continuous traits of every patient using the mean and the standard deviation (SD). For any continuous trait $f_{j}$, let $\mu_{j}$and $\sigma_{j}$ denote its mean and SD over the EP patients. Then the standardized patient is $\tilde{P}_{i}=\left(\tilde{f}_{1}^{i}, \tilde{f}_{2}^{i}, \ldots, \tilde{f}_{n}^{i}\right)$, with $\tilde{f}_{j}^{i}=\frac{f_{j}^{i}-\mu_{j}}{\sigma_{j}}$ , for continuous traits and remaining unchanged $\left(\tilde{f}_{j}^{i}=f_{j}^{i}\right)$.
# normalizing all the continuous features
for col in TP._get_numeric_data().columns:
TP[col+'_z'] = (TP[col] - np.mean(TP[col]))/np.std(TP[col])
2. Use the stringency-based method described above to calculate the significance $s_{j}$, of each trait. For a patient with standardized traits $\hat{P}_{i}$, compute its significance-scaled form by $\hat{P}_{i}=\left(\tilde{f}_{1}^{i}, \ldots, \hat{f}_{n}^{i}\right)=\left(s_{1} \tilde{f}_{1}^{i}, \ldots, s_{n} \tilde{f}_{n}^{i}\right)$ .
# calculate stringency for each eligibility
# get_stringency function was defined in the Elig class above.
for elig in cont_elig_list:
s = elig.get_stringency(TP)
# apply stringency to normalized values
for elig in cont_elig_list:
TP[elig.var_name+'_z_s'] = elig.get_stringency(TP) * TP[elig.var_name+'_z']
3.Use the standardized and significance-scaled patients $\hat{P}_{i}$ to compute a nonlinear regression hyper-surface F with one of the continuous traits (e.g. $\hat{f}_{n}$m) designated as the dependent variable (and all other traits as independent
variables), i.e. $\hat{f}_{n}=F\left(\hat{f}_{1}, \ldots, \hat{f}_{n-1}\right)$
4.For every patient (standardized and significance-scaled) $\hat{P}_{i}$, calculate its residual distance
$d_{i}=/ F\left(\hat{f}_{1}, \ldots, \hat{f}_{n-1}\right)-\hat{f}_{n} /$ from the hyper-surface $F$ . Here we use SVM regression model to achieve step 3 and 4.
# build svm regression model and calculate residual
x = np.repeat(1,TP.shape[0]).reshape((TP.shape[0],1))
y = np.array(TP.AGE_LAST_PX_z_s)
gist_svm = SVR()
gist_svm.fit(x, y)
y_pred = gist_svm.predict(x)
5. Assign a weight $w_{i}$ to every patient $P_{i}$ that is inversely proportional to the patient’s distance di from $F$, i.e. $w_{i}=\frac{1}{1+d_{i}}$ . (Note that $w_{i} = 1$ when $d_{i} = 0$).
TP['f_weight'] = 1 / (1+abs(y_pred - y))
6. For a trial T, let the inclusion range (as given in the eligibility criteria) for trait $f_{j}$ be denoted by the set $E_{j}(T)$. The notation
implies that
the jth trait of patient $P_{i}$ falls within the corresponding inclusion range of the eligibility criteria. Now, the single-trait GIST score $sGIST_{j}$ for trait $f_{j}$
is computed as,
$$s G I S T_{j}(T)=\frac{\sum_{f_{j}^{i} \in E_{j}(T)} w_{i}}{\sum_{f_{j}^{i}} w_{i}}$$
# Calculate single GIST score
# get_gist function is defined in Elig class
for elig in elig_list:
print(elig.var_name, ': ', elig.get_gist(TP))
7. The multi-dimensional (overall) eligibility criteria of T is the logical combination of the eligibility criteria of the individual traits. Let the volume defined by the multi-dimensional eligibility criteria be denoted by Eall(T) (e.g. $E_{all}(T)$ = $E_{1}(T)$ AND [$E_{2}(T)$ OR $E_{3}(T)$]). A patient $P_{i}$ satisfying the overall eligibility criteria for T is said to belong to this volume i.e. $P_{i}$ ∈ Eall(T). Now, the multiple-trait GIST score mGIST of T is
computed as,
$$
m G I S T(T)=\frac{\sum_{P_{i} \in E_{a l l}(T)} w_{i}}{\sum_{P_{i} \in E P} w_{i}}
$$
# Calculate multi GIST score
mgist_e_list = [e.get_elig(TP) for e in elig_list]
mgist_e_subset = functools.reduce(lambda x, y: x&y, mgist_e_list)
gmulti = sum(TP.loc[mgist_e_subset, 'f_weight']) / sum(TP.f_weight)
print('mGIST: ',gmulti)