Blazing fast language detection using fastText model

Last update: Dec 20, 2022

Overview

Luga

A blazing fast language detection using fastText's language models

Luga is a Swahili word for language. fastText provides a blazing fast language detection. It is though a bit funky to download and load models. fastText API is also beauty-less. This is why luga was born.

Installation

python -m pip install -U luga

Usage:

Note: First usage downloads the model for you. This is done only once.

from luga import language

print(language("the world has ended yesterday"))

Comming soon ...

TODO:

refactor artifacts.py
auto checkers with pre-commit | invoke
write more tests
write github actions
create a smart data checker (a fast List[str], what do with none strings)
make it faster with Cython

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

Comments

fix: Fix invalid pytest dependency version
poetry does not want to accept flake8 as a valid versionFixes issue #13

fix: Fix invalid pytest dependency version

fix: Use fasttext-wheel instead of fasttext
opened by saevarb 1
Installation fails with recent poetry due to `fasttext` issues

Hey!

As is explained in this issue: https://github.com/python-poetry/poetry/issues/6113 trying to install fasttext with a recent poetry version fails. This is because fasttext does some really funky things and tries to run a global pip during install. So this means that building luga or using any package that depends on it doesn't work. :/

This means that columbus doesn't build either, since it depends on luga. However, as is outlined in the issue there is a solution: using fasttext-wheel.

I pulled down luga and columbus and updated luga to use fasttext-wheel instead, and managed to get it to install, which also allowed me to build a new version of columbus using the new luga build.

opened by saevarb 1

SSL WRONG_VERSION_NUMBER

Solution from httpx

import httpx
import ssl

ssl_context = httpx.create_ssl_context()
ssl_context.options ^= ssl.OP_NO_TLSv1  # Enable TLS 1.0 back
resp = httpx.get(..., verify=ssl_context)
```

opened by Proteusiq 0

Return array for compatibility with pandas

This fails since pandas expects an array and luga returns a list

texts.loc[languages(texts["texts"].to_list(), only_language=True) == "da"]

But this works

texts.loc[np.array(languages(texts["texts"].to_list(), only_language=True) == "da")]

opened by nthomsencph 0

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.7-py3-none-any.whl(5.55 KB)
luga-0.2.7.tar.gz(5.34 KB)
v0.2.6(Sep 28, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.6-py3-none-any.whl(5.51 KB)
luga-0.2.6.tar.gz(5.32 KB)
v0.2.5(Apr 19, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.5-py3-none-any.whl(5.50 KB)
luga-0.2.5.tar.gz(5.39 KB)
v0.2.4(Dec 23, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.4-py3-none-any.whl(4.60 KB)
luga-0.2.4.tar.gz(4.52 KB)
v0.2.3(Dec 22, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.3-py3-none-any.whl(4.56 KB)
luga-0.2.3.tar.gz(4.46 KB)
v0.2.2(Dec 3, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.2-py3-none-any.whl(4.42 KB)
luga-0.2.2.tar.gz(4.28 KB)
v0.2.1(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.1-py3-none-any.whl(4.07 KB)
luga-0.2.1.tar.gz(3.95 KB)
v0.2.0(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.0-py3-none-any.whl(4.07 KB)
luga-0.2.0.tar.gz(3.95 KB)
v0.1.8(Nov 20, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.8-py3-none-any.whl(3.88 KB)
luga-0.1.8.tar.gz(3.76 KB)
v0.1.7(Nov 17, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.7-py3-none-any.whl(3.81 KB)
luga-0.1.7.tar.gz(3.66 KB)

Owner

Prayson Wilfred Daniel

🍺 Data Scientist | | 🍺 Automating Data Mining & Analysis With Python

GitHub Repository

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

159 Apr 04, 2022

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

363 Dec 30, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 02, 2023

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

163 Dec 31, 2022

This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

2 Mar 29, 2022

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

89 Dec 18, 2022

基于pytorch_rnn的古诗词生成

pytorch_peot_rnn 基于pytorch_rnn的古诗词生成说明 config.py里面含有训练、测试、预测的参数，更改后运行： python main.py 预测结果 if config.do_predict: result = trainer.generate('丽日照残春')

3 May 26, 2022

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

80 Jan 03, 2023

This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

1 Jan 24, 2022

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

364 Jan 03, 2023

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

11.9k Jan 08, 2023

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

11 Nov 16, 2022

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

ruCLIP-SB RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and re

5 Apr 13, 2022

NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023

A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

1.3k Jan 03, 2023

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

211 Dec 28, 2022

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

173 Jan 04, 2023

Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022

Blazing fast language detection using fastText model

Related tags

Overview

Luga

Installation

Usage:

Comming soon ...

TODO:

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Comments

fix: Fix invalid pytest dependency version

Installation fails with recent poetry due to `fasttext` issues

SSL WRONG_VERSION_NUMBER

Return array for compatibility with pandas

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

v0.2.6(Sep 28, 2022)

v0.2.5(Apr 19, 2022)

v0.2.4(Dec 23, 2021)

v0.2.3(Dec 22, 2021)

v0.2.2(Dec 3, 2021)

v0.2.1(Nov 26, 2021)

v0.2.0(Nov 26, 2021)

v0.1.8(Nov 20, 2021)

v0.1.7(Nov 17, 2021)

Owner

Prayson Wilfred Daniel

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Intent parsing and slot filling in PyTorch with seq2seq + attention

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

This repo contains simple to use, pretrained/training-less models for speaker diarization.

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

基于pytorch_rnn的古诗词生成

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

This simple Python program calculates a love score based on your and your crush's full names in English

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

NLP project that works with news (NER, context generation, news trend analytics)

A full spaCy pipeline and models for scientific/biomedical documents.

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Creating a chess engine using GPT-3