Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Overview

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi

CodeT5 demo

Updates

Sep 24, 2021

CodeT5 is now in hugginface!

You can simply load the model (CodeT5-small and CodeT5-base) and do the inference:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints "{user.username}"

Introduction

This repo provides the code for reproducing the experiments in CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on 8.35M functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves state-of-the-art results on 14 sub-tasks in a code intelligence benchmark - CodeXGLUE.

Paper link: https://arxiv.org/abs/2109.00859

Blog link: https://blog.einstein.ai/codet5/

The code currently includes two pre-trained checkpoints (CodeT5-small and CodeT5-base) and scripts to fine-tine them on 4 generation tasks (code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE.

In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

  • Text-to-code generation: generate code based on the natural language description.
  • Code autocompletion: complete the whole function of code given the target function name.
  • Code summarization: generate the summary of a function in natural language description.

Table of Contents

  1. Citation
  2. License
  3. Dependency
  4. Download
  5. Fine-tuning
  6. Get Involved

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
    year={2021},
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing [email protected], and to use appropriate documentation when developing high-stakes applications of this model.

Dependency

  • Pytorch 1.7.1
  • tensorboard 2.4.1
  • transformers 4.6.1
  • tree-sitter 0.2.2

Download

Instructions to download:

pip install gsutil

gsutil -m cp -r "gs://sfr-codet5-data-research/data/" .

mkdir pretrained_models; cd pretrained_models
gsutil -m cp -r \
  "gs://sfr-codet5-data-research/pretrained_models/codet5_small" \
  "gs://sfr-codet5-data-research/pretrained_models/codet5_base" \
  .

The repository structure will look like the following after the download:

├── CODE_OF_CONDUCT.md
├── README.md
├── SECURITY.md
├── codet5.gif
├── configs.py
├── models.py
├── run_clone.py
├── run_gen.py
├── utils.py
├── _utils.py
├── LICENSE.txt
├── data
│   ├── clone
│   ├── concode
│   ├── defect
│   ├── refine
│   │   ├── medium
│   │   └── small
│   ├── summarize
│   │   ├── go
│   │   ├── java
│   │   ├── javascript
│   │   ├── php
│   │   ├── python
│   │   └── ruby
│   └── translate
├── evaluator
│   ├── bleu.py
│   ├── smooth_bleu.py
│   └── CodeBLEU
├── pretrained_models
│   ├── codet5_base
│   └── codet5_small
├── sh
│   ├── exp_with_args.sh
│   ├── run_exp.py
│   ├── results
│   ├── saved_models
│   └── tensorboard
└── tokenizer
    └── salesforce
        ├── codet5-merges.txt
        └── codet5-vocab.json    

Fine-tuning

Go to sh folder, set the WORKDIR in exp_with_args.sh to be your downloaded CodeT5 repository path.

You can use run_exp.py to run a broad set of experiments by simply passing the model_tag, task, and sub_task arguments. In total, we support four models (i.e., ['roberta', 'codebert', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use the sub_task to specify which specific datasets to fine-tine on.

For example, if you want to run CodeT5-base model on the code summarization task for Ruby, you can simply run:

python run_exp.py --model_tag codet5_base --task summarize --sub_task ruby

Besides, you can specify:

model_dir: where to save fine-tuning checkpoints
res_dir: where to save the performance results 
summary_dir: where to save the training curves
data_num: how many data instances to use, the default -1 is for using the full data
gpu: the index of the GPU to use in the cluster

You can also revise the suggested arguments here and refer to the argument flags in configs.py for the full available options. The saved training curves in summary_dir can be visualized using tensorboard.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Pre-train or Annotate? Domain Adaptation with a Constrained Budget This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Ann

Fan Bai 8 Dec 17, 2021
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 03, 2023
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Prithivida 690 Jan 04, 2023
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

Rank-One Model Editing (ROME) This repository provides an implementation of Rank-One Model Editing (ROME) on auto-regressive transformers (GPU-only).

Kevin Meng 130 Dec 21, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
Pangu-Alpha for Transformers

Pangu-Alpha for Transformers Usage Download MindSpore FP32 weights for GPU from here to data/Pangu-alpha_2.6B.ckpt Activate MindSpore environment and

One 5 Oct 01, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022