Tutorial: GIST 2.0

In this tutorial, we’ll use an example clinical trial(NCTID: NCT00025337) to demonstrate the calculation of GIST 2.0 for this trial.

After reading this tutorial, you will be able to calculate the sGIST for different traits and mGIST for the whole trial.

First, let’s get started with importing libraries:

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.svm import SVR
import functools
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
%matplotlib inline
pd.set_option('display.max_rows', 500)

Import and explore data

Load data as a dataframe:

In [2]:
# Get the absolute path of the file
file_path = os.path.abspath("CRC_AE_FINAL.xlsx")
# Load data as a dataframe
tp_data = pd.read_excel(file_path)
In [3]:
tp_data.shape
Out[3]:
(2772, 13)

Take a look at the first five rows of the dataframe:

In [4]:
tp_data.head()
Out[4]:
AGE_LAST_PX coagulation brain_metastases angina MI HF ureteral_stents pregnant 1920-8 6301-6 Oxaliplatin DIASTOLIC SYSTOLIC
0 74 0 0 0 0 0 0 0 NaN NaN 1 NaN NaN
1 74 0 0 0 0 0 0 0 NaN NaN 1 61.0 143.0
2 61 0 0 0 1 0 0 0 NaN NaN 1 NaN NaN
3 40 0 0 0 0 0 0 0 NaN NaN 0 NaN NaN
4 68 0 0 0 1 0 0 0 NaN NaN 1 73.0 162.0

Data Preprocessing

Replace all the Null value to Nan:

In [5]:
tp_data = tp_data.replace('NULL', np.nan)

Transfer the categorical data columns to category data type:

In [6]:
for col in [
 'coagulation',
 'brain_metastases',
 'angina',
 'MI',
 'HF',
 'ureteral_stents',
 'pregnant',
 'Oxaliplatin']:
    
    tp_data[col] = tp_data[col].astype('category')


Exploring missing values and handling missing values

In [7]:
def check_missing(data):
    total = data.isnull().sum().sort_values(ascending=False)
    percentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percentage*100], axis=1, keys=['# of Missing', 'percentage %'])
    return missing_data
In [8]:
print(check_missing(tp_data))
                  # of Missing  percentage %
6301-6                    2671     96.356421
1920-8                    2667     96.212121
SYSTOLIC                  2397     86.471861
DIASTOLIC                 2397     86.471861
Oxaliplatin                  0      0.000000
pregnant                     0      0.000000
ureteral_stents              0      0.000000
HF                           0      0.000000
MI                           0      0.000000
angina                       0      0.000000
brain_metastases             0      0.000000
coagulation                  0      0.000000
AGE_LAST_PX                  0      0.000000

Note that we have 2772 observations in total.
We can see from the output that column ‘6301-6’, ‘1920-8′,’SYSTOLIC’ and ‘DIASTOLIC’ are missing most of the values, while other columns have no missing values.

Dealing with missing values

In [9]:
imp = SimpleImputer(missing_values=np.nan, strategy='median')
# Filling these missing values with their median
imp_list = ['6301-6','1920-8','SYSTOLIC','DIASTOLIC']
imp.fit(tp_data[imp_list])
tp_data_imputed = tp_data
tp_data_imputed[imp_list] = imp.transform(tp_data[imp_list])

Ceck the missing values again

In [10]:
print(check_missing(tp_data_imputed))
                  # of Missing  percentage %
SYSTOLIC                     0           0.0
DIASTOLIC                    0           0.0
Oxaliplatin                  0           0.0
6301-6                       0           0.0
1920-8                       0           0.0
pregnant                     0           0.0
ureteral_stents              0           0.0
HF                           0           0.0
MI                           0           0.0
angina                       0           0.0
brain_metastases             0           0.0
coagulation                  0           0.0
AGE_LAST_PX                  0           0.0

Check one of the imputed column

In [11]:
tp_data_imputed['1920-8'].head()
Out[11]:
0    23.0
1    23.0
2    23.0
3    23.0
4    23.0
Name: 1920-8, dtype: float64

Calculation

Define a class that is able to calculate part of GIST

In [12]:
# Eligibility class
class Elig:
    def __init__(self,
                 var_name: str = None, # variable name
                 is_range: bool = True, # True means continuous variable
                 low_bound: float = None, # lower bound for eligible creteria
                 upp_bound: float = None, # upper bound for eligible creteria
                 equal_to: float = None): # For categorical variables, 0 for presence, 1 for absence
        self.var_name = var_name
        self.is_range = is_range
        self.low_bound = low_bound
        self.upp_bound = upp_bound
        self.equal_to = equal_to
 
    # Check if a patient meets a certain criterion or not: elig_status return True or False.
    def get_elig(self, data):
        if self.is_range:
            if self.low_bound is None and self.upp_bound is not None:
                elig_status = (data[self.var_name] <= self.upp_bound)
            if self.low_bound is not None and self.upp_bound is None:
                elig_status = (data[self.var_name] >= self.low_bound)
            if self.low_bound is not None and self.upp_bound is not None:
                elig_status = ((data[self.var_name] >= self.low_bound) & (data[self.var_name] <= self.upp_bound))
        else:
            elig_status = (data[self.var_name] == self.equal_to)
        return elig_status
    
    # Calculate the stringency of a certain criterion
    def get_stringency(self, data):
        if self.is_range:
            elig_n = data.loc[(data[self.var_name] >= self.low_bound) & (data[self.var_name] <= self.upp_bound), self.var_name].shape[0]
        else:
            elig_n = data.loc[(data[self.var_name] == self.equal_to), self.var_name].shape[0]
        stringency = 1 - elig_n / data.shape[0]
        return max(stringency, 0.01)
    
    # Calculate sGIST (single trait GIST) for a particular criterion 
    def get_gist(self, data):
        gist = sum(data.loc[self.get_elig(data), 'f_weight']) / sum(data.f_weight)
        return gist

Set up the eligibility criteria used for calculating GIST 2.0

In [13]:
Tle_AGE_LAST_PX = Elig(var_name='AGE_LAST_PX',low_bound=18)

T1e_brain_metastases = Elig(var_name='brain_metastases',is_range=False,equal_to=0)

T1e_coagulation = Elig(var_name='coagulation',is_range=False,equal_to=0)

T1e_angina = Elig(var_name='angina',is_range=False,equal_to=0)

T1e_MI = Elig(var_name='MI',is_range=False,equal_to=0)

T1e_HF = Elig(var_name='HF',is_range=False,equal_to=0)

T1e_ureteral_stents = Elig(var_name='ureteral_stents',is_range=False,equal_to=0)

T1e_pregnant = Elig(var_name='pregnant',is_range=False,equal_to=0)

T1e_Oxaliplatin = Elig(var_name='Oxaliplatin',is_range=False,equal_to=1)

T1e_DIASTOLIC = Elig(var_name='DIASTOLIC',upp_bound=100)

T1e_SYSTOLIC = Elig(var_name='SYSTOLIC',upp_bound=150)

T1e_1920_8 = Elig(var_name='1920-8',upp_bound=125)

T1e_6301_6 = Elig(var_name='6301-6',upp_bound=1.5)

Get the list of eligibility criteria

In [14]:
# Categorical eligible traits list
cate_elig_list = [T1e_brain_metastases,T1e_coagulation,T1e_angina,T1e_MI,T1e_HF,T1e_ureteral_stents,T1e_pregnant]
# Continuous eligible traits list
cont_elig_list = [Tle_AGE_LAST_PX]
# Eligible list
elig_list = cate_elig_list + cont_elig_list

Drop the rows with missing values

In [15]:
TP = tp_data_imputed[[
 'AGE_LAST_PX',
 'brain_metastases',
 'coagulation',
 'angina',
 'MI',
 'HF',
 'ureteral_stents',
 'pregnant']].dropna()

TP.shape
Out[15]:
(2772, 8)

After dropping rows with missing values, we found out that the number of rows remain the same, which means that there were no missing values. Now, let’s take a look at what the dataframe look like.

In [16]:
TP.head()
Out[16]:
AGE_LAST_PX brain_metastases coagulation angina MI HF ureteral_stents pregnant
0 74 0 0 0 0 0 0 0
1 74 0 0 0 0 0 0 0
2 61 0 0 0 1 0 0 0
3 40 0 0 0 0 0 0 0
4 68 0 0 0 1 0 0 0

Now we get data ready and let’s check out how to calculate GIST.

GIST Calculation Algorithm

1. Standardize the summarized continuous traits of every patient using the mean and the standard deviation (SD). For any continuous trait $f_{j}$, let $\mu_{j}$and $\sigma_{j}$ denote its mean and SD over the EP patients. Then the standardized patient is $\tilde{P}_{i}=\left(\tilde{f}_{1}^{i}, \tilde{f}_{2}^{i}, \ldots, \tilde{f}_{n}^{i}\right)$, with $\tilde{f}_{j}^{i}=\frac{f_{j}^{i}-\mu_{j}}{\sigma_{j}}$ , for continuous traits and remaining unchanged $\left(\tilde{f}_{j}^{i}=f_{j}^{i}\right)$.

In [17]:
# normalizing all the continuous features
for col in TP._get_numeric_data().columns:
    TP[col+'_z'] = (TP[col] - np.mean(TP[col]))/np.std(TP[col])

2. Use the stringency-based method described above to calculate the significance $s_{j}$, of each trait. For a patient with standardized traits $\hat{P}_{i}$, compute its significance-scaled form by $\hat{P}_{i}=\left(\tilde{f}_{1}^{i}, \ldots, \hat{f}_{n}^{i}\right)=\left(s_{1} \tilde{f}_{1}^{i}, \ldots, s_{n} \tilde{f}_{n}^{i}\right)$ .

In [18]:
# calculate stringency for each eligibility
# get_stringency function was defined in the Elig class above.
for elig in cont_elig_list:
    s = elig.get_stringency(TP)
In [19]:
# apply stringency to normalized values
for elig in cont_elig_list:
    TP[elig.var_name+'_z_s'] = elig.get_stringency(TP) * TP[elig.var_name+'_z']

3.Use the standardized and significance-scaled patients $\hat{P}_{i}$ to compute a nonlinear regression hyper-surface F with one of the continuous traits (e.g. $\hat{f}_{n}$m) designated as the dependent variable (and all other traits as independent
variables), i.e. $\hat{f}_{n}=F\left(\hat{f}_{1}, \ldots, \hat{f}_{n-1}\right)$

4.For every patient (standardized and significance-scaled) $\hat{P}_{i}$, calculate its residual distance
$d_{i}=/ F\left(\hat{f}_{1}, \ldots, \hat{f}_{n-1}\right)-\hat{f}_{n} /$ from the hyper-surface $F$ . Here we use SVM regression model to achieve step 3 and 4.

In [20]:
# build svm regression model and calculate residual
x = np.repeat(1,TP.shape[0]).reshape((TP.shape[0],1))
y = np.array(TP.AGE_LAST_PX_z_s)
gist_svm = SVR()
gist_svm.fit(x, y)
y_pred = gist_svm.predict(x)
C:\Users\arsla\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

5. Assign a weight $w_{i}$ to every patient $P_{i}$ that is inversely proportional to the patient’s distance di from $F$, i.e. $w_{i}=\frac{1}{1+d_{i}}$ . (Note that $w_{i} = 1$ when $d_{i} = 0$).

In [21]:
TP['f_weight'] = 1 / (1+abs(y_pred - y))

6. For a trial T, let the inclusion range (as given in the eligibility criteria) for trait $f_{j}$ be denoted by the set $E_{j}(T)$. The notation
implies that
the jth trait of patient $P_{i}$ falls within the corresponding inclusion range of the eligibility criteria. Now, the single-trait GIST score $sGIST_{j}$ for trait $f_{j}$
is computed as,
$$s G I S T_{j}(T)=\frac{\sum_{f_{j}^{i} \in E_{j}(T)} w_{i}}{\sum_{f_{j}^{i}} w_{i}}$$

In [22]:
# Calculate single GIST score
# get_gist function is defined in Elig class
for elig in elig_list:
    print(elig.var_name, ': ', elig.get_gist(TP))
brain_metastases :  0.9376441124828709
coagulation :  0.943259521364352
angina :  0.9492021075086515
MI :  0.8755589781496852
HF :  0.8858948696226507
ureteral_stents :  1.0
pregnant :  0.9963814556168069
AGE_LAST_PX :  0.9998932390370622

7. The multi-dimensional (overall) eligibility criteria of T is the logical combination of the eligibility criteria of the individual traits. Let the volume defined by the multi-dimensional eligibility criteria be denoted by Eall(T) (e.g. $E_{all}(T)$ = $E_{1}(T)$ AND [$E_{2}(T)$ OR $E_{3}(T)$]). A patient $P_{i}$ satisfying the overall eligibility criteria for T is said to belong to this volume i.e. $P_{i}$ ∈ Eall(T). Now, the multiple-trait GIST score mGIST of T is
computed as,
$$
m G I S T(T)=\frac{\sum_{P_{i} \in E_{a l l}(T)} w_{i}}{\sum_{P_{i} \in E P} w_{i}}
$$

In [23]:
# Calculate multi GIST score
mgist_e_list = [e.get_elig(TP) for e in elig_list]
mgist_e_subset = functools.reduce(lambda x, y: x&y, mgist_e_list)
gmulti = sum(TP.loc[mgist_e_subset, 'f_weight']) / sum(TP.f_weight)
print('mGIST: ',gmulti)
mGIST:  0.695867038771289