PyTorch impelementations of BERT-based Spelling Error Correction Models.

Last update: Dec 30, 2022

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型，使用PyTorch实现

数据准备

从 http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录

请确保以下文件在 datasets/csc 中

train.sgml
B1_training.sgml
C1_training.sgml  
SIGHAN15_CSC_A2_Training.sgml  
SIGHAN15_CSC_B2_Training.sgml  
SIGHAN15_CSC_TestInput.txt
SIGHAN15_CSC_TestTruth.txt

环境准备

使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境（推荐）
克隆本项目并进入项目根目录
安装所需依赖 pip install -r requirements.txt
如果出现报错 GLIBC 版本过低的问题（GLIBC 的版本更迭容易出事故，不推荐更新），openCC 改为安装较低版本（例如 1.1.0）
在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型，首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型，目前支持以下配置文件：

train_bert4csc.yml
train_macbert4csc.yml
train_SoftMaskedBert.yml

如有其他需求，可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component	sentence level acc	p	r	f
Detection	0.5045	0.8252	0.8416	0.8333
Correction	0.8055	0.9395	0.8748	0.9060

Bert类

char level

MODEL	p	r	f
BERT4CSC	0.9269	0.8651	0.8949
MACBERT4CSC	0.9380	0.8736	0.9047

sentence level

model	acc	p	r	f
BERT4CSC	0.7990	0.8482	0.7214	0.7797
MACBERT4CSC	0.8027	0.8525	0.7251	0.7836

推理

方法一，使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下：

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二，直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重，使用transformers或pycorrector调用

使用convert_to_pure_state_dict.py导出bert权重
后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目，请按如下格式引用：

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0，可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护，侵权必究。

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Related tags

Overview

BertBasedCorrectionModels

数据准备

环境准备

训练

实验结果

SoftMaskedBert

Bert类

char level

sentence level

推理

方法一，使用inference脚本:

方法二，直接调用

方法三、导出bert权重，使用transformers或pycorrector调用

引用

License

更新记录

20210618

20210518

20210517

References

Owner

Heng Cai

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Concept Modeling: Topic Modeling on Images and Text

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

🏖 Easy training and deployment of seq2seq models.

Search msDS-AllowedToActOnBehalfOfOtherIdentity

Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

PortaSpeech - PyTorch Implementation

Almost State-of-the-art Text Generation library

Generate vector graphics from a textual caption

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

HAN2HAN : Hangul Font Generation

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Utilizing RBERT model for KLUE Relation Extraction task

This program do translate english words to portuguese

Voilà turns Jupyter notebooks into standalone web applications

Dust model dichotomous performance analysis

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries