A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

Espial is an engine for automated organization and discovery of personal knowledge

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

Natural Language Processing Tasks and Examples.

GooAQ 🥑 : Google Answers to Google Questions!

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Simple, hackable offline speech to text - using the VOSK-API.

Use fastai-v2 with HuggingFace's pretrained transformers

Code for evaluating Japanese pretrained models provided by NTT Ltd.

Sentence Embeddings with BERT & XLNet

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

A framework for implementing federated learning

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Use the power of GPT3 to execute any function inside your programs just by giving some doctests

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

EdiTTS: Score-based Editing for Controllable Text-to-Speech

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

txtai: Build AI-powered semantic search applications in Go

Dust model dichotomous performance analysis