MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Last update: Oct 19, 2022

Overview

MILES

Multilingual Lexical Simplifier
Explore the docs »

Read LSBert Paper · Report Bug · Request Feature

About The Project

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking. MILES currently supports 22 languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Indonesian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Ukrainian.

As a result of not using any language-specific resources (WordNets, POS taggers, parallel corpora, etc.), MILES does not always offer synonymous substitutions for complex words. Although almost always simpler than the original, selected substitutions may alter the meaning of the text. Please keep this in mind, and feel free to download and tailor MILES to a language of your choosing!

Prerequisites

FastText Embeddings

It is recommended that fastText embeddings are downloaded for your target language/s. These will be used by MILES to make notably more accurate simplifications. To install fastText embeddings for MILES, download the .vec embeddings for you target language here. Once done, place the .vec file in simplifier/embeddings/ before running the key vector generation script with the ISO 639-1 code for the selected language:

python simplifier/embeddings/gen_keyed_vectors.py <ISO 639-1 code>

Usage

Flask App

MILES simplifications can be done using either a simple Flask app provided or the command line. To start using the Flask app, run app.py with ISO 639-1 language code:

python app.py -l <ISO 639-1 code>

Once running, open 127.0.0.1 in your browser and start simplifying!

Command Line

If you would prefer to use the command line, there are a couple of options available:

Simplifying sentences:

python simplify.py -t <sentence> -l <ISO 639-1 code>

Simplifying text files:

python simplify.py -f <text_file> -l <ISO 639-1 code>

Note: If no language code is provided, text will be simplified assuming it's English. The default language can be changed in simplifier/config.py.

Framework

Roadmap

See the open issues for a list of proposed features (and known issues).

Contact

If you have any questions or concerns, message me on LinkedIn or email me at [email protected].

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Related tags

Overview

MILES

About The Project

Prerequisites

FastText Embeddings

Usage

Flask App

Command Line

Framework

Roadmap

Contact

Owner

Kane

Codename generator using WordNet parts of speech database

Repositório da disciplina no semestre 2021-2

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

precise iris segmentation

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ask for weather information like a human

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

Russian words synonyms and antonyms

A PyTorch Implementation of End-to-End Models for Speech-to-Text

Translation to python of Chris Sims' optimization function

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

Quick insights from Zoom meeting transcripts using Graph + NLP

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

This repo contains simple to use, pretrained/training-less models for speaker diarization.

NLTK Source

A Chinese to English Neural Model Translation Project

Image2pcl - Enter the metaverse with 2D image to 3D projections

Awesome-NLP-Research (ANLP)