🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Overview

PAUSE: Positive and Annealed Unlabeled Sentence Embedding

Sentence embedding refers to a set of effective and versatile techniques for converting raw text into numerical vector representations that can be used in a wide range of natural language processing (NLP) applications. The majority of these techniques are either supervised or unsupervised. Compared to the unsupervised methods, the supervised ones make less assumptions about optimization objectives and usually achieve better results. However, the training requires a large amount of labeled sentence pairs, which is not available in many industrial scenarios. To that end, we propose a generic and end-to-end approach -- PAUSE (Positive and Annealed Unlabeled Sentence Embedding), capable of learning high-quality sentence embeddings from a partially labeled dataset, which effectively learns sentence embeddings from PU datasets by jointly optimizing the supervised and PU loss. The main highlights of PAUSE include:

  • good sentence embeddings can be learned from datasets with only a few positive labels;
  • it can be trained in an end-to-end fashion;
  • it can be directly applied to any dual-encoder model architecture;
  • it is extended to scenarios with an arbitrary number of classes;
  • polynomial annealing of the PU loss is proposed to stabilize the training;
  • our experiments (reproduction steps are illustrated below) show that PAUSE constantly outperforms baseline methods.

This repository contains Tensorflow implementation of PAUSE to reproduce the experimental results. Upon using this repo for your work, please cite:

@inproceedings{cao2021pause,
  title={PAUSE: Positive and Annealed Unlabeled Sentence Embedding},
  author={Cao, Lele and Larsson, Emil and von Ehrenheim, Vilhelm and Cavalcanti Rocha, Dhiana Deva and Martin, Anna and Horn, Sonja},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2021},
  url={https://arxiv.org/abs/2109.03155}
}

Prerequisites

Install virtual environment first to avoid breaking your native environment. If you use Anaconda, do

conda update conda
conda create --name py37-pause python=3.7
conda activate py37-pause

Then install the dependent libraries:

pip install -r requirements.txt

Unsupervised STS

Models are trained on a combination of the SNLI and Multi-Genre NLI datasets, which contain one million sentence pairs annotated with three labels: entailment, contradiction and neutral. The trained model is tested on the STS 2012-2016, STS benchmark, and SICK-Relatedness (SICK-R) datasets, which have labels between 0 and 5 indicating the semantic relatedness of sentence pairs.

Training

Example 1: train PAUSE-small using 5% labels for 10 epochs

python train_nli.py \
  --batch_size=1024 \
  --train_epochs=10 \
  --model=small \
  --pos_sample_prec=5

Example 2: train PAUSE-base using 30% labels for 20 epochs

python train_nli.py \
  --batch_size=1024 \
  --train_epochs=20 \
  --model=base \
  --pos_sample_prec=30

To check the parameters, run

python train_nli.py --help

which will print the usage as follows.

usage: train_nli.py [-h] [--model MODEL]
                    [--pretrained_weights PRETRAINED_WEIGHTS]
                    [--train_epochs TRAIN_EPOCHS] [--batch_size BATCH_SIZE]
                    [--train_steps_per_epoch TRAIN_STEPS_PER_EPOCH]
                    [--max_seq_len MAX_SEQ_LEN] [--prior PRIOR]
                    [--train_lr TRAIN_LR] [--pos_sample_prec POS_SAMPLE_PREC]
                    [--log_dir LOG_DIR] [--model_dir MODEL_DIR]

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         The tfhub link for the base embedding model
  --pretrained_weights PRETRAINED_WEIGHTS
                        The pretrained model if any
  --train_epochs TRAIN_EPOCHS
                        The max number of training epoch
  --batch_size BATCH_SIZE
                        Training mini-batch size
  --train_steps_per_epoch TRAIN_STEPS_PER_EPOCH
                        Step interval of evaluation during training
  --max_seq_len MAX_SEQ_LEN
                        The max number of tokens in the input
  --prior PRIOR         Expected ratio of positive samples
  --train_lr TRAIN_LR   The maximum learning rate
  --pos_sample_prec POS_SAMPLE_PREC
                        The percentage of sampled positive examples used in
                        training; should be one of 1, 10, 30, 50, 70
  --log_dir LOG_DIR     The path where the logs are stored
  --model_dir MODEL_DIR
                        The path where models and weights are stored

Testing

After the model is trained, you will be prompted to where the model is saved, e.g. ./artifacts/model/20210517-131724, where the directory name (20210517-131724) is the model ID. To test the model with that ID, run

python test_sts.py --model=20210517-131724

The test result on STS datasets will be printed on console and also saved in file ./artifacts/test/sts_20210517-131724.txt

Supervised STS

Train

You can continue to finetune a pertained model on supervised STSb. For example, assume we have trained a PAUSE model based on small BERT (say located at ./artifacts/model/20210517-131725), if we want to finetune the model on STSb for 2 epochs, we can run

python ft_stsb.py \
  --model=small \
  --train_epochs=2 \
  --pretrained_weights=./artifacts/model/20210517-131725

Note that it is important to match the model size (--model) with the pretrained model size (--pretrained_weights).

Testing

After the model is finetuned, you will be prompted to where the model is saved, e.g. ./artifacts/model/20210517-131726, where the directory name (20210517-131726) is the model ID. To test the model with that ID, run

python ft_stsb_test.py --model=20210517-131726

SentEval evaluation

To evaluate the PAUSE embeddings using SentEval (preferably using GPU), you need to download the data first:

cd ./data/downstream
./get_transfer_data.bash
cd ../..

Then, run the sent_eval.py script:

python sent_eval.py \
  --data_path=./data \
  --model=20210328-212801

where the --model parameter specifies the ID of the model you want to evaluate. By default, the model should exist in folder ./artifacts/model/embed. If you want to evaluate a trained model in our public GCS (gs://motherbrain-pause/model/...), please run (e.g. PAUSE-NLI-base-50%):

python sent_eval.py \
  --data_path=./data \
  --model_location=gcs \
  --model=20210329-065047

We provide the following models for demonstration purposes:

Model Model ID
PAUSE-NLI-base-100% 20210414-162525
PAUSE-NLI-base-70% 20210328-212801
PAUSE-NLI-base-50% 20210329-065047
PAUSE-NLI-base-30% 20210329-133137
PAUSE-NLI-base-10% 20210329-180000
PAUSE-NLI-base-5% 20210329-205354
PAUSE-NLI-base-1% 20210329-195024
You might also like...
Code for
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Extract Keywords from sentence or Replace keywords in sentences.
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Extract Keywords from sentence or Replace keywords in sentences.
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Releases(1.0)
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
PyWorld3 is a Python implementation of the World3 model

The World3 model revisited in Python Install & Hello World3 How to tune your own simulation Licence How to cite PyWorld3 with Bibtex References & ackn

Charles Vanwynsberghe 248 Dec 14, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 - treatments and vaccinations.

Project: Text Analysis - This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 -

1 Mar 14, 2022
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Curso práctico: NLP de cero a cien 🤗

Curso Práctico: NLP de cero a cien Comprende todos los conceptos y arquitecturas clave del estado del arte del NLP y aplícalos a casos prácticos utili

Somos NLP 147 Jan 06, 2023