PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Last update: Apr 29, 2022

Related tags

Text Data & NLP FeatureCRFAE

Overview

Feature_CRF_AE

Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging:

@inproceedings{zhou-etal-2022-Bridging,
  title     = {Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging},
  author    = {Zhou, houquan and Li, yang and Li, Zhenghua and Zhang Min},
  booktitle = {Findings of ACL},
  year      = {2022},
  url       = {?},
  pages     = {?--?}
}

Please concact Jacob_Zhou \at outlook.com if you have any questions.

Contents
Installation
Performance
Usage

Installation

Feature_CRF_AE can be installing from source:

$ git clone https://github.com/Jacob-Zhou/FeatureCRFAE && cd FeatureCRFAE
$ bash scripts/setup.sh

The following requirements will be installed in scripts/setup.sh:

python: 3.7
allennlp: 1.2.2
pytorch: 1.6.0
transformers: 3.5.1
h5py: 3.1.0
matplotlib: 3.3.1
nltk: 3.5
numpy: 1.19.1
overrides: 3.1.0
scikit_learn: 1.0.2
seaborn: 0.11.0
tqdm: 4.49.0

For WSJ data, we use the ELMo representations of elmo_2x4096_512_2048cnn_2xhighway_5.5B from AllenNLP. For UD data, we use the ELMo representations released by HIT-SCIR.

The corresponding data and ELMo models can be download as follows:

# 1) UD data and ELMo models:
$ bash scripts/prepare_data.sh
# 2) UD data, ELMo models as well as WSJ data 
#    [please replace ~/treebank3/parsed/mrg/wsj/ with your path to LDC99T42]
$ bash scripts/prepare_data.sh ~/treebank3/parsed/mrg/wsj/

Performance

WSJ-All

Seed	M-1	1-1	VM
0	84.29	70.03	78.43
1	82.34	64.42	77.27
2	84.68	62.78	77.83
3	82.55	65.00	77.35
4	82.20	66.69	77.33
Avg.	83.21	65.78	77.64
Std.	1.18	2.75	0.49

WSJ-Test

Seed	M-1	1-1	VM
0	81.99	64.84	76.86
1	82.52	61.46	76.13
2	82.33	61.15	75.13
3	78.11	58.80	72.94
4	82.05	61.68	76.21
Avg.	81.40	61.59	75.45
Std.	1.85	2.15	1.54

Usage

We give some examples on scripts/examples.sh. Before run the code you should activate the virtual environment by:

$ . scripts/set_environment.sh

Training

To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:

$ python -u -m tagger.cmds.crf_ae train \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --train data/wsj/total.conll \
    --evaluate data/wsj/total.conll \
    --path save/crf_ae_wsj

$ python -u -m tagger.cmds.crf_ae train \
    --conf configs/crf_ae.ini \
    --ud-mode \
    --ud-feature \
    --ignore-capitalized \
    --language-specific-strip \
    --feat-min-freq 14 \
    --language de \
    --encoder elmo \
    --plm elmo_models/de \
    --train data/ud/de/total.conll \
    --evaluate data/ud/de/total.conll \
    --path save/crf_ae_de

For more instructions on training, please type python -m tagger.cmds.[crf_ae|feature_hmm] train -h.

Alternatively, We provides some equivalent command entry points registered in setup.py: crf-ae and feature-hmm.

$ crf-ae train \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --train data/wsj/total.conll \
    --evaluate data/wsj/total.conll \
    --path save/crf_ae

Evaluation

$ python -u -m tagger.cmds.crf_ae evaluate \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --data data/wsj/total.conll \
    --path save/crf_ae

Predict

$ python -u -m tagger.cmds.crf_ae predict \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --data data/wsj/total.conll \
    --path save/crf_ae \
    --pred save/crf_ae/pred.conll

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Related tags

Overview

Feature_CRF_AE

Contents

Installation

Performance

WSJ-All

WSJ-Test

Usage

Training

Evaluation

Predict

Owner

Jacob Zhou

COVID-19 Related NLP Papers

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

Weaviate demo with the text2vec-openai module

Simple Annotated implementation of GPT-NeoX in PyTorch

Codes to pre-train Japanese T5 models

Header-only C++ HNSW implementation with python bindings

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Linear programming solver for paper-reviewer matching and mind-matching

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

MRC approach for Aspect-based Sentiment Analysis (ABSA)

This project converts your human voice input to its text transcript and to an automated voice too.

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Code and data accompanying Natural Language Processing with PyTorch

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

BERT Attention Analysis