Fast, DB Backed pretrained word embeddings for natural language processing.

Last update: Nov 21, 2022

Overview

Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Fast, DB Backed pretrained word embeddings for natural language processing.

Related tags

Overview

Embeddings

Installation

Usage

Docker

Contribution

Owner

Victor Zhong

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

End-to-End Speech Processing Toolkit

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

A minimal Conformer ASR implementation adapted from ESPnet.

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

Scene Text Retrieval via Joint Text Detection and Similarity Learning

AI-Broad-casting - AI Broad casting with python

Code for text augmentation method leveraging large-scale language models

Journalism AI – Quotes extraction for modular journalism

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

Host your own GPT-3 Discord bot

Ongoing research training transformer language models at scale, including: BERT & GPT-2

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.