Associated Repository for "Translation between Molecules and Natural Language"

Related tags

Text Data & NLPMolT5
Overview

MolT5: Translation between Molecules and Natural Language

Associated repository for "Translation between Molecules and Natural Language".

Table of Contents

HuggingFace model checkpoints

All of our HuggingFace checkpoints are located here.

Pretrained MolT5-based checkpoints include:

You can also easily find our fine-tuned caption2smiles and smiles2caption models. For example, molt5-large-smiles2caption is a molt5-large model that has been further fine-tuned for the task of molecule captioning (i.e., smiles2caption).

Example usage for molecule captioning (i.e., smiles2caption):

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-smiles2caption", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-smiles2caption')

input_text = 'C1=CC2=C(C(=C1)[O-])NC(=CC2=O)C(=O)O'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, num_beams=5, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example usage for molecule generation (i.e., caption2smiles):

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-caption2smiles", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-caption2smiles')

input_text = 'The molecule is a monomethoxybenzene that is 2-methoxyphenol substituted by a hydroxymethyl group at position 4. It has a role as a plant metabolite. It is a member of guaiacols and a member of benzyl alcohols.'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, num_beams=5, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

T5X-based model checkpoints

Pretraining (MolT5-based models)

We used the open-sourced t5x framework for pretraining MolT5-based models.

For pre-training MolT5-based models, please first go over this document. In our work, our pretraining task is a mixture of c4_v220_span_corruption and also our own task called zinc_span_corruption. The pretraining mixture is called zinc_and_c4_mix. The code snippet below illustrates how to define zinc_and_c4_mix (e.g., you can just add this code snippet to tasks.py). Our Gin config files for pretraining are located in configs/pretrain. Data files can be downloaded from here.

...
import tensorflow.compat.v2 as tf
...
seqio.TaskRegistry.add(
    'zinc_span_corruption',
    source=seqio.TFExampleDataSource(
        split_to_filepattern={
            'test': # Path to zinc_smiles_test.tfrecords,
            'validation': # Path to zinc_smiles_val.tfrecords,
            'train': # Path to zinc_smiles_train.tfrecords,
        },
        feature_description={
            'text': tf.io.FixedLenFeature([], dtype=tf.string),
        }),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                'inputs': None,
                'targets': 'text'
            }),
        seqio.preprocessors.tokenize,
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=[])

seqio.MixtureRegistry.add('zinc_and_c4_mix', [('zinc_span_corruption', 1),
                                              ('c4_v220_span_corruption', 1)])
)

Finetuning (MolT5-based models)

We also used the t5x framework for finetuning MolT5-based models. Please first go over this document. Our Gin config files for finetuning are located in configs/finetune. For each of the Gin file, you need to set the INITIAL_CHECKPOINT_PATH variables (please use one of the checkpoints mentioned in this section). Note that there are two new tasks, which are named caption2smiles and smiles2caption. The code snippet below illustrates how to define the tasks. Data files can be downloaded from here.

...
# Metrics
_TASK_EVAL_METRICS_FNS = [
    metrics.bleu,
    metrics.rouge,
    metrics.sequence_accuracy
]

# Data Source
DATA_SOURCE = seqio.TFExampleDataSource(
    split_to_filepattern={
        'train': # Path to chebi_20_train.tfrecords,
        'validation': # Path to chebi_20_dev.tfrecords,
        'test': # Path to chebi_20_test.tfrecords
    },
    feature_description={
        'caption': tf.io.FixedLenFeature([], dtype=tf.string),
        'smiles': tf.io.FixedLenFeature([], dtype=tf.string),
        'cid': tf.io.FixedLenFeature([], dtype=tf.string),
    }
)

# Molecular Captioning (smiles2caption)
seqio.TaskRegistry.add(
    'smiles2caption',
    source=DATA_SOURCE,
    preprocessors=[
        functools.partial(
            preprocessors.rekey,
            key_map={
                'inputs': 'smiles',
                'targets': 'caption'
            }),
        seqio.preprocessors.tokenize,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=_TASK_EVAL_METRICS_FNS,
)

# Molecular Captioning (caption2smiles)
seqio.TaskRegistry.add(
    'caption2smiles',
    source=DATA_SOURCE,
    preprocessors=[
        functools.partial(
            preprocessors.rekey,
            key_map={
                'inputs': 'caption',
                'targets': 'smiles'
            }),
        seqio.preprocessors.tokenize,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=_TASK_EVAL_METRICS_FNS,
)

Datasets

Citation

If you found our work useful, please cite:

@article{edwards2022translation,
  title={Translation between Molecules and Natural Language},
  author={Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Ji, Heng},
  journal={arXiv preprint arXiv:2204.11817},
  year={2022}
}
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

Phil Wang 217 Nov 25, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transform

Bytedance Inc. 2.5k Jan 03, 2023
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.Specifica

Meta Research 193 Dec 28, 2022
This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

LipGAN Generate realistic talking faces for any human speech and face identity. [Paper] | [Project Page] | [Demonstration Video] Important Update: A n

Rudrabha Mukhopadhyay 438 Dec 31, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022