Predictive Modeling on Electronic Health Records(EHR) using Pytorch

Last update: Jan 01, 2023

Related tags

Overview

Predictive Modeling on Electronic Health Records(EHR) using Pytorch

Overview

Although there are plenty of repos on vision and NLP models, there are very limited repos on EHR using deep learning that we can find. Here we open source our repo, implementing data preprocessing, data loading, and a zoo of common RNN models. The main goal is to lower the bar of entering this field for researchers. We are not claiming any state-of-the-art performance, though our models are quite competitive (a paper describing our work will be available soon).

Based on existing works (e.g., Dr. AI and RETAIN), we represent electronic health records (EHRs) using the pickled list of list of list, which contain histories of patients' diagnoses, medications, and other various events. We integrated all relevant information of a patient's history, allowing easy subsetting.

Currently, this repo includes the following predictive models: Vanilla RNN, GRU, LSTM, Bidirectional RNN, Bidirectional GRU, Bidirectional LSTM, Dilated RNN, Dilated GRU, Dilated LSTM, QRNN,and T-LSTM to analyze and predict clinical performaces. Additionally we have tutorials comparing perfomance to plain LR, Random Forest.

Pipeline

Primary Results

Note this result is over two prediction tasks: Heart Failure (HF) risk and Readmission. We showed simple gated RNNs (GRUs or LSTMs) consistently beat traditional MLs (logistic regression (LR) and Random Forest (RF)). All methods were tuned by Bayesian Optimization. All these are described in this paper.

Folder Organization

ehr_pytorch: main folder with modularized components:
- EHREmb.py: EHR embeddings
- EHRDataloader.py: a separate module to allow for creating batch preprocessed data with multiple functionalities including sorting on visit length and shuffle batches before feeding.
- Models.py: multiple different models
- Utils.py
- main.py: main execution file
- tplstm.py: tplstm package file
Data
- toy.train: pickle file of toy data with the same structure (multi-level lists) of our processed Cerner data, can be directly utilized for our models for demonstration purpose;
Preprocessing
- data_preprocessing_v1.py: preprocess the data from dataset to build the required multi-level input structure (clear description of how to run this file is in its document header)
Tutorials
- RNN_tutorials_toy.ipynb: jupyter notebooks with examples on how to run our models with visuals and/or utilize our dataloader as a standalone;
- HF prediction for Diabetic Patients.ipynb
- Early Readmission v2.ipynb
trained_models examples:
- hf.trainEHRmodel.log: examples of the output of the model
- hf.trainEHRmodel.pth: actual trained model
- hf.trainEHRmodel.st: state dictionary

Data Structure

We followed the data structure used in the RETAIN. Encounters may include pharmacy, clinical and microbiology laboratory, admission, and billing information from affiliated patient care locations. All admissions, medication orders and dispensing, laboratory orders, and specimens are date and time stamped, providing a temporal relationship between treatment patterns and clinical information.These clinical data are mapped to the most common standards, for example, diagnoses and procedures are mapped to the International Classification of Diseases (ICD) codes, medimultications information include the national drug codes (NDCs), and laboratory tests are linked to their LOINIC codes.
Our processed pickle data: multi-level lists. From most outmost to gradually inside (assume we have loaded them as X)
- Outmost level: patients level, e.g. X[0] is the records for patient indexed 0
- 2nd level: patient information indicated in X[0][0], X[0][1], X[0][2] are patient id, disease status (1: yes, 0: no disease), and records
- 3rd level: a list of length of total visits. Each element will be an element of two lists (as indicated in 4)
- 4th level: for each row in the 3rd-level list.
  - 1st element, e.g. X[0][2][0][0] is list of visit_time (since last time)
  - 2nd element, e.g. X[0][2][0][1] is a list of codes corresponding to a single visit
- 5th level: either a visit_time, or a single code
An illustration of the data structure is shown below:

In the implementation, the medical codes are tokenized with a unified dictionary for all patients.

Notes: as long as you have multi-level list you can use our EHRdataloader to generate batch data and feed them to your model

Paper Reference

The paper upon which this repo was built.

Versions This is Version 0.2, more details in the release notes

Dependencies

Pytorch 0.4.0 (All models except T-LSTM are compatible with pytorch version 1.4.0) , Issues appear with pytorch 1.5 solved in 1.6 version
Torchqrnn
Pynvrtc
sklearn
Matplotlib (for visualizations)
tqdm
Python: 3.6+

Usage

For preprocessing python data_preprocessing.py The above case and control files each is just a three columns table like pt_id | medical_code | visit/event_date
To run our models, directly use (you don't need to separately run dataloader, everything can be specified in args here):

python3 main.py -root_dir<'your folder that contains data file(s)'> -files<['filename(train)' 'filename(valid)' 'filename(test)']> -which_model<'RNN'> -optimizer<'adam'> ....(feed as many args as you please)

Example:

python3.7 main.py -root_dir /.../Data/ -files sample.train sample.valid sample.test -input_size 15800 -batch_size 100 -which_model LR -lr 0.01 -eps 1e-06 -L2 1e-04

To singly use our dataloader for generating data batches, use:

data = EHRdataFromPickles(root_dir = '../data/', 
                          file = ['toy.train'])
loader =  EHRdataLoader(data, batch_size = 128)

#Note: If you want to split data, you must specify the ratios in EHRdataFromPickles() otherwise, call separate loaders for your seperate data files If you want to shuffle batches before using them, add this line

loader = iter_batch2(loader = loader, len(loader))

otherwise, directly call

for i, batch in enumerate(loader): 
    #feed the batch to do things

Check out this notebook with a step by step guide of how to utilize our package.

Warning

This repo is for research purpose. Using it at your own risk.
This repo is under GPL-v3 license.

Acknowledgements Hat-tip to:

Comments

kaplan meier

I attended your session during ACM-BCB conference. Great presentation! I have one question regarding survival analysis. What is the purpose of the "kaplan meier plot" used in survival analysis in ModelTraining file. Is it some kind of baseline to your actual models or is it shoing that survival probability predicted by best model is same as kaplan meier ?

opened by mehak25 2
Getting embedding error when running main.py with toy.train

Hi @ZhiGroup and @lrasmy,

I am very impressed by this work.

I am getting the attached error when trying to retrieve the embeddings in the EmbedPatients_MB(self,mb_t, mtd) method when using the toy.train file. I just wanted to test the repo's code with this sample data. Should I not use this file and just follow the ACM-BCB-Tutorial instead to generate the processed data?

Thank you so much for providing this code and these tutorials, it is very help.

Best Regards,

Aaron Reich

opened by agr505 1
Cell_type option

Currently user can input any cell_type (e.g. celltype of "QRNN" for EHR_RNN model), leading to some mismatch in handling packPadMode.
=> Restrict cell_type option to "RNN", "GRU", "LSTM". => Make cell_type of "QRNN" and "TLSTM" a default for qrnn, tlstm model.

opened by 2miatran 1
Mia test
MODIFIED PARTS: Main.py

Modify codes to take data with split options (split is True => split to train, test, valid, split is False => keep the file and sort)

Add model prefix (the hospital name) and suffix (optional: user input) to output file

Batch_size is used in EHRdataloader => need to give batch_size parameter to dataloader instead of ut.epochs_run()

Results are different due to embedded => No modification. Laila's suggestion: change codes in EHRmb.py

Eps (currently not required for current optimizer Adagrad but might need later for other optimzers)

n_layer default to 1

args = parser.parse_args([])

Utils:

Remove batch_size in all functions

Add prefix, suffix to the epochs_run function

Note: mia_test_1 is first created for testing purpose, please ignore this file.
opened by 2miatran 1
Random results with each run even with setting Random seed

Testing GPU performance:

GPU 0 Run 1: Epoch 1 Train_auc : 0.8716401835745263 , Valid_auc : 0.8244826612068169 ,& Test_auc : 0.8398872287083271 Avg Loss: 0.2813216602802277 Train Time (0m 38s) Eval Time (0m 53s)

Epoch 2 Train_auc : 0.8938440516209567 , Valid_auc : 0.8162852367127903 ,& Test_auc : 0.836586122995983 Avg Loss: 0.26535209695498146 Train Time (0m 38s) Eval Time (0m 53s)

Epoch 3 Train_auc : 0.9090785000429356 , Valid_auc : 0.8268489421541162 ,& Test_auc : 0.8355234191881434 Avg Loss: 0.25156350443760556 Train Time (0m 38s) Eval Time (0m 53s) (edited)

lrasmy [3:27 PM]

GPU0 Run 2: Epoch 1 Train_auc : 0.870730593956147 , Valid_auc : 0.8267809126014227 ,& Test_auc : 0.8407658238915342 Avg Loss: 0.28322121808926265 Train Time (0m 39s) Eval Time (0m 53s)

Epoch 2 Train_auc : 0.8918280081196787 , Valid_auc : 0.814092171574357 ,& Test_auc : 0.8360580004715573 Avg Loss: 0.26621529906988145 Train Time (0m 39s) Eval Time (0m 53s)

Epoch 3 Train_auc : 0.9128840712381358 , Valid_auc : 0.8237124792427901 ,& Test_auc : 0.839372227662688 Avg Loss: 0.2513388389100631 Train Time (0m 39s) Eval Time (0m 54s)

lrasmy [3:43 PM]

GPU0 Run 3: Epoch 1 Train_auc : 0.8719306438569514 , Valid_auc : 0.8290540285789691 ,& Test_auc : 0.8416333372040562 Avg Loss: 0.28306034040947753 Train Time (0m 40s) Eval Time (0m 55s)

Epoch 2 Train_auc : 0.8962238893571299 , Valid_auc : 0.812984847168468 ,& Test_auc : 0.8358539036875299 Avg Loss: 0.26579822269578773 Train Time (0m 39s) Eval Time (0m 54s)

Epoch 3 Train_auc : 0.9131959085864382 , Valid_auc : 0.824907504397332 ,& Test_auc : 0.8411787765451596 Avg Loss: 0.24994653667012848 Train Time (0m 40s) Eval Time (0m 54s)

opened by lrasmy 1

Releases(v0.2-Feb20)

v0.2-Feb20(Feb 21, 2020)
This release is offering a faster and more memory efficient code than the previously released version

Key Changes:

Moving paddings and mini-batches related tensors creation to the EHR_dataloader

Creating the mini-batches list once before running the epochs

Adding RETAIN to the models list

Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation).

FlatGCN This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation, submitted to ICASSP2022). Req

2 Aug 09, 2022

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

A Memory-saving Training Framework for Transformers This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for Trans

105 Dec 06, 2022

Python module providing a framework to trace individual edges in an image using Gaussian process regression.

Edge Tracing using Gaussian Process Regression Repository storing python module which implements a framework to trace individual edges in an image usi

7 Dec 27, 2022

New AidForBlind - Various Libraries used like OpenCV and other mentioned in Requirements.txt

AidForBlind Recommended PyCharm IDE Various Libraries used like OpenCV and other

1 Jan 13, 2022

(ICCV 2021) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing."

Dressing in Order (DiOr) 👚 [Paper] 👖 [Webpage] 👗 [Running this code] The official implementation of "Dressing in Order: Recurrent Person Image Gene

277 Dec 28, 2022

PyTorch code for 'Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning'

Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning This repository is for EMSRDPN introduced in the foll

7 Feb 10, 2022

Vision Deep-Learning using Tensorflow, Keras.

Welcome! I am a computer vision deep learning developer working in Korea. This is my blog, and you can see everything I've studied here. https://www.n

6 Dec 14, 2022

A unet implementation for Image semantic segmentation

Unet-pytorch a unet implementation for Image semantic segmentation 参考网上的Unet做分割的代码，做了一个针对kaggle地盐识别的，请去以下地址获取数据集: https://www.kaggle.com/c/tgs-salt-id

3 Jun 29, 2022

An auto discord account and token generator. Automatically verifies the phone number. Works without proxy. Bypasses captcha.

JOIN DISCORD SERVER https://discord.gg/uAc3agBY FREE HCAPTCHA SOLVING API Discord-Token-Gen An auto discord token generator. Auto verifies phone numbe

271 Jan 01, 2023

Large scale and asynchronous Hyperparameter Optimization at your fingertip.

Syne Tune This package provides state-of-the-art distributed hyperparameter optimizers (HPO) where trials can be evaluated with several backend option

236 Jan 01, 2023

The official repo of the CVPR2021 oral paper: Representative Batch Normalization with Feature Calibration

Representative Batch Normalization (RBN) with Feature Calibration The official implementation of the CVPR2021 oral paper: Representative Batch Normali

76 Nov 09, 2022

Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching Official pytorch implementation of "Show, Attend and Distill: Kn

80 Dec 16, 2022

moving object detection for satellite videos.

DSFNet: Dynamic and Static Fusion Network for Moving Object Detection in Satellite Videos Algorithm Introduction DSFNet: Dynamic and Static Fusion Net

39 Dec 16, 2022

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

297 Dec 27, 2022

ProFuzzBench - A Benchmark for Stateful Protocol Fuzzing

ProFuzzBench - A Benchmark for Stateful Protocol Fuzzing ProFuzzBench is a benchmark for stateful fuzzing of network protocols. It includes a suite of

155 Jan 08, 2023

Exploration of some patients clinical variables.

Answer_ALS_clinical_data Exploration of some patients clinical variables. All the clinical / metadata data is available here: https://data.answerals.o

1 Jan 20, 2022

Model-based Reinforcement Learning Improves Autonomous Racing Performance

Racing Dreamer: Model-based versus Model-free Deep Reinforcement Learning for Autonomous Racing Cars In this work, we propose to learn a racing contro

38 Dec 06, 2022

Non-stationary GP package written from scratch in PyTorch

NSGP-Torch Examples gpytorch model with skgpytorch # Import packages import torch from regdata import NonStat2D from gpytorch.kernels import RBFKernel

1 Mar 06, 2022

[NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature"

IP-IRM [NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature". Codes will be relea

67 Dec 24, 2022

Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Isaac ROS Visual Odometry This repository provides a ROS2 package that estimates stereo visual inertial odometry using the Isaac Elbrus GPU-accelerate

343 Jan 03, 2023

Predictive Modeling on Electronic Health Records(EHR) using Pytorch

Related tags

Overview

Predictive Modeling on Electronic Health Records(EHR) using Pytorch

Comments

kaplan meier

Getting embedding error when running main.py with toy.train

Cell_type option

Mia test

Random results with each run even with setting Random seed

Releases(v0.2-Feb20)

v0.2-Feb20(Feb 21, 2020)

Owner

This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation).

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Python module providing a framework to trace individual edges in an image using Gaussian process regression.

New AidForBlind - Various Libraries used like OpenCV and other mentioned in Requirements.txt

(ICCV 2021) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing."

PyTorch code for 'Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning'

Vision Deep-Learning using Tensorflow, Keras.

A unet implementation for Image semantic segmentation

An auto discord account and token generator. Automatically verifies the phone number. Works without proxy. Bypasses captcha.

Large scale and asynchronous Hyperparameter Optimization at your fingertip.

The official repo of the CVPR2021 oral paper: Representative Batch Normalization with Feature Calibration

Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

moving object detection for satellite videos.

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ProFuzzBench - A Benchmark for Stateful Protocol Fuzzing

Exploration of some patients clinical variables.

Model-based Reinforcement Learning Improves Autonomous Racing Performance

Non-stationary GP package written from scratch in PyTorch

[NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature"

Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.