Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Overview

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

This is the official repository for the EMNLP 2021 long paper Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration. We provide code for training and evaluating Phrase-BERT in addition to the datasets used in the paper.

Update: the model is also available now on Huggingface thanks to the help from whaleloops and nreimers!

Setup

This repository depends on sentence-BERT version 0.3.3, which you can install from the source using:

>>> git clone https://github.com/UKPLab/sentence-transformers.git --branch v0.3.3
>>> cd sentence-transformers/
>>> pip install -e .

Also you can install sentence-BERT with pip:

>>> pip install sentence-transformers==0.3.3

Quick Start

The following example shows how to use a trained Phrase-BERT model to embed phrases into dense vectors.

First download and unzip our model.

>>> cd 
   
    
>>> wget https://storage.googleapis.com/phrase-bert/phrase-bert/phrase-bert-model.zip
>>> unzip phrase-bert-model.zip -d phrase-bert-model/
>>> rm phrase-bert-model.zip

   

Then load the Phrase-BERT model through the sentence-BERT interface:

from sentence_transformers import SentenceTransformer
model_path = '
   
    '
model = SentenceTransformer(model_path)

   

You can compute phrase embeddings using Phrase-BERT as follows:

phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs

As in sentence-BERT, the default output is a list of numpy arrays:

for phrase, embedding in zip(phrase_list, phrase_embs):
    print("Phrase:", phrase)
    print("Embedding:", embedding)
    print("")

An example of computing the dot product of phrase embeddings:

import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')

An example of computing cosine similarity of phrase embeddings:

import torch 
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')

The output should look like:

The dot product between phrase 1 and 2 is: 218.43600463867188
The dot product between phrase 1 and 3 is: 165.48483276367188
The dot product between phrase 2 and 3 is: 160.51708984375
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
The cosine similarity between phrase 2 and 3 is: 0.584893524646759

Evaluation

Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:

Change config/model_path.py with the model path according to your directories and

  • For evaluation on Turney, run python eval_turney.py

  • For evaluation on BiRD, run python eval_bird.py

  • for evaluation on PPDB / PPDB-filtered / PAWS-short, run eval_ppdb_paws.py with:

    nohup python  -u eval_ppdb_paws.py \
        --full_run_mode \
        --task 
         
           \
        --data_dir 
          
            \
        --result_dir 
           
             \
        >./output.txt 2>&1 &
    
           
          
         

Train your own Phrase-BERT

If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to phrase-bert/phrase_bert_finetune.py

The datasets we used to fine-tune Phrase-BERT are here: training data csv file and validation data csv file.

To re-produce the trained Phrase-BERT, please run:

export INPUT_DATA_PATH=
   
    
export TRAIN_DATA_FILE=
    
     
export VALID_DATA_FILE=
     
      
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens 
export OUTPUT_MODEL_PATH=
      
       


python -u phrase_bert_finetune.py \
    --input_data_path $INPUT_DATA_PATH \
    --train_data_file $TRAIN_DATA_FILE \
    --valid_data_file $VALID_DATA_FILE \
    --input_model_path $INPUT_MODEL_PATH \
    --output_model_path $OUTPUT_MODEL_PATH

      
     
    
   

Citation:

Please cite us if you find this useful:

@inproceedings{phrasebertwang2021,
    author={Shufan Wang and Laure Thompson and Mohit Iyyer},
    Booktitle = {Empirical Methods in Natural Language Processing},
    Year = "2021",
    Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
}
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

Inui Laboratory 387 Dec 30, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 03, 2023
Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CvarAdversarialRL Official code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning". Initial setup Create a virtual

Mathieu Godbout 1 Nov 19, 2021
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

AI-For-Road-Safety Challenge hosted by Omdena Hyderabad Chapter Original Repo Link : https://github.com/OmdenaAI/omdena-india-roadsafety Final Present

Prathima Kadari 7 Nov 29, 2022
YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集(Yet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

fast.ai ULMFiT with SentencePiece from pretraining to deployment Motivation: Why even bother with a non-BERT / Transformer language model? Short answe

Florian Leuerer 26 May 27, 2022
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
C.J. Hutto 3.8k Dec 30, 2022