Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Overview

cim-misspelling

Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.

image

This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.

Requirements

How to Run

Clone the repo

$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git

Data preparing

  1. Download the MIMIC-III dataset from PhysioNet, especially NOTEEVENTS.csv and put under data/mimic3

  2. Download LRWD and prevariants of the SPECIALIST Lexicon from the LSG website (2018AB version) and put under data/umls.

  3. Download the English dictionary english.txt from here (commit 7cb484d) and put under data/english_words.

  4. Run scripts/build_vocab_corpus.ipynb to build the dictionary and split the MIMIC-III notes into files.

  5. Run the Jupyter notebook for the dataset that you want to download/pre-process:

    • MIMIC-III misspelling dataset, or ClinSpell (Fivez et al., 2017): scripts/preprocess_clinspell.ipynb
    • CSpell dataset (Lu et al., 2019): scripts/preprocess_cspell.ipynb
    • Synthetic misspelling dataset from the MIMIC-III: scripts/synthetic_dataset.ipynb
  6. Download the BlueBERT model from here under bert/ncbi_bert_{base|large}.

    • For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
    • For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"

Pre-training the char-based LM on MIMIC-III

Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM. The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data. You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256. There are several options you may want to configure:

  • num_gpus: number of GPUs
  • batch_size: batch size
  • training_step: total number of steps to train
  • init_ckpt/init_step: the checkpoint file/steps to resume pretraining
  • num_beams: beam search width for evaluation
  • mimic_csv_dir: directory of the MIMIC-III csv splits
  • bert_dir: directory of the BlueBERT files

You can also download the pre-trained LMs and put under model/:

Misspelling Correction with CIM

Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load

Cite this work

@InProceedings{juyong2022context,
  title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
  author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  pages = {234--247},
  year = {2022},
  volume = {174},
  series = {Proceedings of Machine Learning Research},
  month = {07--08 Apr},
  publisher = {PMLR}
}
Owner
Juyong Kim
Juyong Kim
Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Running SB3 developed agents on TFLite or Coral Introduction I've been using Stable-Baselines3 to train agents against some custom Gyms, some of which

Gary Briggs 16 Oct 11, 2022
Torchyolo - Yolov3 ve Yolov4 modellerin Pytorch uygulamasıdır

TORCHYOLO : Yolo Modellerin Pytorch Uygulaması Yapılacaklar: Yolov3 model.py ve

Kadir Nar 3 Aug 22, 2022
A script helps the user to update Linux and Mac systems through the terminal

Description This script helps the user to update Linux and Mac systems through the terminal. All the user has to install some requirements and then ru

Roxcoder 2 Jan 23, 2022
Greedy Gaussian Segmentation

GGS Greedy Gaussian Segmentation (GGS) is a Python solver for efficiently segmenting multivariate time series data. For implementation details, please

Stanford University Convex Optimization Group 72 Dec 07, 2022
some academic posters as references. May we have in-person poster session soon!

some academic posters as references. May we have in-person poster session soon!

Bolei Zhou 472 Jan 06, 2023
Code for CVPR2021 paper 'Where and What? Examining Interpretable Disentangled Representations'.

PS-SC GAN This repository contains the main code for training a PS-SC GAN (a GAN implemented with the Perceptual Simplicity and Spatial Constriction c

Xinqi/Steven Zhu 40 Dec 16, 2022
We propose a new method for effective shadow removal by regarding it as an exposure fusion problem.

Auto-exposure fusion for single-image shadow removal We propose a new method for effective shadow removal by regarding it as an exposure fusion proble

Qing Guo 146 Dec 31, 2022
This repository contains code to run experiments in the paper "Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers."

Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers This repository contains code to run experiments in the paper "Signal Stre

0 Jan 19, 2022
Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically

NB Workflows Description If SQL is a lingua franca for querying data, Jupyter sh

Xavier Petit 6 Aug 18, 2022
This is the replication package for paper submission: Towards Training Reproducible Deep Learning Models.

This is the replication package for paper submission: Towards Training Reproducible Deep Learning Models.

0 Feb 02, 2022
Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations, CVPR 2019 (Oral)

Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations The code of: Weakly Supervised Learning of Instance Segmentation with I

Jiwoon Ahn 472 Dec 29, 2022
Simple ONNX operation generator. Simple Operation Generator for ONNX.

sog4onnx Simple ONNX operation generator. Simple Operation Generator for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key concept V

Katsuya Hyodo 6 May 15, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.datasets: The raw text iterators for common NLP datasets torchtext.data: Some basic NLP building bloc

3.2k Jan 08, 2023
基于tensorflow 2.x的图片识别工具集

Classification.tf2 基于tensorflow 2.x的图片识别工具集 功能 粗粒度场景图片分类 细粒度场景图片分类 其他场景图片分类 模型部署 tensorflow serving本地推理和docker部署 tensorRT onnx ... 数据集 https://hyper.a

Wei Qi 1 Nov 03, 2021
Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Ibai Gorordo 99 Dec 31, 2022
Public repo for the ICCV2021-CVAMD paper "Is it Time to Replace CNNs with Transformers for Medical Images?"

Is it Time to Replace CNNs with Transformers for Medical Images? Accepted at ICCV-2021: Workshop on Computer Vision for Automated Medical Diagnosis (C

Christos Matsoukas 80 Dec 27, 2022
Semantic Image Synthesis with SPADE

Semantic Image Synthesis with SPADE New implementation available at imaginaire repository We have a reimplementation of the SPADE method that is more

NVIDIA Research Projects 7.3k Jan 07, 2023
Spectral normalization (SN) is a widely-used technique for improving the stability and sample quality of Generative Adversarial Networks (GANs)

Why Spectral Normalization Stabilizes GANs: Analysis and Improvements [paper (NeurIPS 2021)] [paper (arXiv)] [code] Authors: Zinan Lin, Vyas Sekar, Gi

Zinan Lin 32 Dec 16, 2022
Weakly Supervised Segmentation by Tensorflow.

Weakly Supervised Segmentation by Tensorflow. Implements semantic segmentation in Simple Does It: Weakly Supervised Instance and Semantic Segmentation, by Khoreva et al. (CVPR 2017).

CHENG-YOU LU 52 Dec 27, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022