Tandem Mass Spectrum Prediction with Graph Transformers

Overview

MassFormer

This is the original implementation of MassFormer, a graph transformer for small molecule MS/MS prediction. Check out the preprint on arxiv.

Setting Up Environment

We recommend using conda. Three conda yml files are provided in the env/ directory (cpu.yml, cu101.yml, cu102.yml), providing different pytorch installation options (CPU-only, CUDA 10.1, CUDA 10.2). They can be trivially modified to support other versions of CUDA.

To set up an environment, run the command conda env create -f ${CONDA_YAML}, where ${CONDA_YAML} is the path to the desired yaml file.

Downloading NIST Data

Note: this step requires a Windows System or Virtual Machine

The NIST 2020 LC-MS/MS dataset can be purchased from an authorized distributor. The spectra and associated compounds can be exported to MSP/MOL format using the included lib2nist software. There is a single MSP file which contains all of the mass spectra, and multiple MOL files which include the molecular structure information for each spectrum (linked by ID). We've included a screenshot describing the lib2nist export settings.

Alt text

There is a minor bug in the export software that sometimes results in errors when parsing the MOL files. To fix this bug, run the script python mol_fix.py ${MOL_DIR}, where ${MOL_DIR} is a path to the NIST export directory with MOL files.

Downloading Massbank Data

The MassBank of North America (MB-NA) data is in MSP format, with the chemical information provided in the form of a SMILES string (as opposed to a MOL file). It can be downloaded from the MassBank website, under the tab "LS-MS/MS Spectra".

Exporting and Preparing Data

We recommend creating a directory called data/ and placing the downloaded and uncompressed data into a folder data/raw/.

To parse both of the datasets, run parse_and_export.py. Then, to prepare the data for model training, run prepare_data.py. By default the processed data will end up in data/proc/.

Setting Up Weights and Biases

Our implementation uses Weights and Biases (W&B) for logging and visualization. For full functionality, you must set up a free W&B account.

Training Models

A default config file is provided in "config/template.yml". This trains a MassFormer model on the NIST HCD spectra. Our experiments used systems with 32GB RAM, 1 Nvidia RTX 2080 (11GB VRAM), and 6 CPU cores.

The config/ directory has a template config file template.yml and 8 files corresponding to the experiments from the paper. The template config can be modified to train models of your choosing.

To train a template model without W&B with only CPU, run python runner.py -w False -d -1

To train a template model with W&B on CUDA device 0, run python runner.py -w True -d 0

Reproducing Tables

To reproduce a model from one of the experiments in Table 2 or Table 3 from the paper, run python runner.py -w True -d 0 -c ${CONFIG_YAML} -n 5 -i ${RUN_ID}, where ${CONFIG_YAML} refers to a specific yaml file in the config/ directory and ${RUN_ID} refers to an arbitrary but unique integer ID.

Reproducing Visualizations

The explain.py script can be used to reproduce the visualizations in the paper, but requires a trained model saved on W&B (i.e. by running a script from the previous section).

To reproduce a visualization from Figures 2,3,4,5, run python explain.py ${WANDB_RUN_ID} --wandb_mode=online, where ${WANDB_RUN_ID} is the unique W&B run id of the desired model's completed training script. The figues will be uploaded as PNG files to W&B.

Reproducing Sweeps

The W&B sweep config files that were used to select model hyperparameters can be found in the sweeps/ directory. They can be initialized using wandb sweep ${PATH_TO_SWEEP}.

Owner
Röst Lab
Röst lab at U of T -- join us at https://gitter.im/Roestlab/Lobby
Röst Lab
🐞 📊 Ladybug extension to generate 2D charts

ladybug-charts Ladybug extension to generate 2D charts. Installation pip install ladybug-charts QuickStart import ladybug_charts API Documentation Loc

Ladybug Tools 3 Dec 30, 2022
TensorDebugger (TDB) is a visual debugger for deep learning. It extends TensorFlow with breakpoints + real-time visualization of the data flowing through the computational graph

TensorDebugger (TDB) is a visual debugger for deep learning. It extends TensorFlow (Google's Deep Learning framework) with breakpoints + real-time visualization of the data flowing through the comput

Eric Jang 1.4k Dec 15, 2022
clock_plot provides a simple way to visualize timeseries data, mapping 24 hours onto the 360 degrees of a polar plot

clock_plot clock_plot provides a simple way to visualize timeseries data mapping 24 hours onto the 360 degrees of a polar plot. For usage, please see

12 Aug 24, 2022
Tools for writing, submitting, debugging, and monitoring Storm topologies in pure Python

Petrel Tools for writing, submitting, debugging, and monitoring Storm topologies in pure Python. NOTE: The base Storm package provides storm.py, which

AirSage 247 Dec 18, 2021
A tool to plot and execute Rossmos's Formula, that helps to catch serial criminals using mathematics

Rossmo Plotter A tool to plot and execute Rossmos's Formula using python, that helps to catch serial criminals using mathematics Author: Amlan Saha Ku

Amlan Saha Kundu 3 Aug 29, 2022
Generate SVG (dark/light) images visualizing (private/public) GitHub repo statistics for profile/website.

Generate daily updated visualizations of GitHub user and repository statistics from the GitHub API using GitHub Actions for any combination of private and public repositories, whether owned or contri

Adam Ross 2 Dec 16, 2022
Rick and Morty Data Visualization with python

Rick and Morty Data Visualization For this project I looked at data for the TV show Rick and Morty Number of Episodes at a Certain Location Here is th

7 Aug 29, 2022
Shaded 😎 quantile plots

shadyquant 😎 This python package allows you to quantile and plot lines where you have multiple samples, typically for visualizing uncertainty. Your d

Mehrad Ansari 13 Sep 29, 2022
A custom qq-plot for two sample data comparision

QQ-Plot 2 Sample Just a gist to include the custom code to draw a qq-plot in python when dealing with a "two sample problem". This means when u try to

1 Dec 20, 2021
Cryptocurrency Centralized Exchange Visualization

This is a simple one that uses Grafina to visualize cryptocurrency from the Bitkub exchange. This service will make a request to the Bitkub API from your wallet and save the response to Postgresql. G

Popboon Mahachanawong 1 Nov 24, 2021
OpenStats is a library built on top of streamlit that extracts data from the Github API and shows the main KPIs

Open Stats Discover and share the KPIs of your OpenSource project. OpenStats is a library built on top of streamlit that extracts data from the Github

Pere Miquel Brull 4 Apr 03, 2022
Exploratory analysis and data visualization of aircraft accidents and incidents in Brazil.

Exploring aircraft accidents in Brazil Occurrencies with aircraft in Brazil are investigated by the Center for Investigation and Prevention of Aircraf

Augusto Herrmann 5 Dec 14, 2021
This is a Web scraping project using BeautifulSoup and Python to scrape basic information of all the Test matches played till Jan 2022.

Scraping-test-matches-data This is a Web scraping project using BeautifulSoup and Python to scrape basic information of all the Test matches played ti

Souradeep Banerjee 4 Oct 10, 2022
EPViz is a tool to aid researchers in developing, validating, and reporting their predictive modeling outputs.

EPViz (EEG Prediction Visualizer) EPViz is a tool to aid researchers in developing, validating, and reporting their predictive modeling outputs. A lig

Jeff 2 Oct 19, 2022
Automatization of BoxPlot graph usin Python MatPlotLib and Excel

BoxPlotGraphAutomation Automatization of BoxPlot graph usin Python / Excel. This file is an automation of BoxPlot-Graph using python graph library mat

EricAugustin 1 Feb 07, 2022
Data science project for exploratory analysis on the kcse grades dataset (Kamilimu Data Science Track)

Kcse-Data-Analysis Data science project for exploratory analysis on the kcse grades dataset (Kamilimu Data Science Track) Findings The performance of

MUGO BRIAN 1 Feb 23, 2022
GitHub English Top Charts

Help you discover excellent English projects and get rid of the interference of other spoken language.

kon9chunkit 529 Jan 02, 2023
Rockstar - Makes you a Rockstar C++ Programmer in 2 minutes

Rockstar Rockstar is one amazing library, which will make you a Rockstar Programmer in just 2 minutes. In last decade, people learned C++ in 21 days.

4k Jan 05, 2023
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 694 Jan 04, 2023
A set of three functions, useful in geographical calculations of different sorts

GreatCircle A set of three functions, useful in geographical calculations of different sorts. Available for PHP, Python, Javascript and Ruby. Live dem

72 Sep 30, 2022