Search with BERT vectors in Solr and Elasticsearch

Overview

BERT models with Solr and Elasticsearch

streamlit-search_demo_solr-2021-05-13-10-05-91.mp4
streamlit-search_demo_elasticsearch-2021-05-14-22-05-55.mp4

This code is described in the following Medium stories, taking one step at a time:

Neural Search with BERT and Solr (August 18,2020)

Fun with Apache Lucene and BERT Embeddings (November 15, 2020)

Speeding up BERT Search in Elasticsearch (March 15, 2021)

Ask Me Anything about Vector Search (June 20, 2021) This blog post gives the answers to the 3 most interesting questions asked during the AMA session at Berlin Buzzwords 2021. The video recording is available here: https://www.youtube.com/watch?v=blFe2yOD1WA

Bert in Solr hat Bert with_es burger


Tech stack:

  • bert-as-service
  • Hugging Face
  • solr / elasticsearch
  • streamlit
  • Python 3.7

Code for dealing with Solr has been copied from the great (and highly recommended) https://github.com/o19s/hello-ltr project.

Install tensorflow

pip install tensorflow==1.15.3

If you try to install tensorflow 2.3, bert service will fail to start, there is an existing issue about it.

If you encounter issues with the above installation, consider installing full list of packages:

pip install -r requirements_freeze.txt

Let's install bert-as-service components

pip install bert-serving-server

pip install bert-serving-client

Download a pre-trained BERT model

into the bert-model/ directory in this project. I have chosen uncased_L-12_H-768_A-12.zip for this experiment. Unzip it.

Now let's start the BERT service

bash start_bert_server.sh

Run a sample bert client

python src/bert_client.py

to compute vectors for 3 sample sentences:

    Bert vectors for sentences ['First do it', 'then do it right', 'then do it better'] : [[ 0.13186474  0.32404128 -0.82704437 ... -0.3711958  -0.39250174
      -0.31721866]
     [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
      -0.11345179]
     [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366  -0.39310536
       0.07640187]]

This sets up the stage for our further experiment with Solr.

Dataset

This is by far the key ingredient of every experiment. You want to find an interesting collection of texts, that are suitable for semantic level search. Well, maybe all texts are. I have chosen a collection of abstracts from DBPedia, that I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into data/dbpedia directory in bz2 format. You don't need to extract this file onto disk: the provided code will read directly from the compressed file.

Preprocessing and Indexing: Solr

Before running preprocessing / indexing, you need to configure the vector plugin, which allows to index and query the vector data. You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring

After the plugin's jar has been added, configure it in the solrconfig.xml like so:

">

  

Schema also requires an addition: field of type VectorField is required in order to index vector data:

">

  

Find ready-made schema and solrconfig here: https://github.com/DmitryKey/bert-solr-search/tree/master/solr_conf

Let's preprocess the downloaded abstracts, and index them in Solr. First, execute the following command to start Solr:

bin/solr start -m 2g

If during processing you will notice:

<...>/bert-solr-search/venv/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=500" set on the server, as consequence you may get less-accurate or truncated embeddings.
here is what you can do:
- disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)

The index_dbpedia_abstracts_solr.py script will output statistics:

Maximum tokens observed per abstract: 697
Flushing 100 docs
Committing changes
All done. Took: 82.46466588973999 seconds

We know how many abstracts there are:

bzcat data/dbpedia/long_abstracts_en.ttl.bz2 | wc -l
5045733

Preprocessing and Indexing: Elasticsearch

This project implements several ways to index vector data:

  • src/index_dbpedia_abstracts_elastic.py vanilla Elasticsearch: using dense_vector data type
  • src/index_dbpedia_abstracts_elastiknn.py Elastiknn plugin: implements own data type. I used elastiknn_dense_float_vector
  • src/index_dbpedia_abstracts_opendistro.py OpenDistro for Elasticsearch: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing

Each indexer relies on ready-made Elasticsearch mapping file, that can be found in es_conf/ directory.

Preprocessing and Indexing: GSI APU

In order to use GSI APU solution, a user needs to produce two files: numpy 2D array with vectors of desired dimension (768 in my case) a pickle file with document ids matching the document ids of the said vectors in Elasticsearch.

After these data files get uploaded to the GSI server, the same data gets indexed in Elasticsearch. The APU powered search is performed on up to 3 Leda-G PCIe APU boards. Since I’ve run into indexing performance with bert-as-service solution, I decided to take SBERT approach from Hugging Face to prepare the numpy and pickle array files. This allowed me to index into Elasticsearch freely at any time, without waiting for days. You can use this script to do this on DBPedia data, which allows choosing between:

EmbeddingModel.HUGGING_FACE_SENTENCE (SBERT)
EmbeddingModel.BERT_UNCASED_768 (bert-as-service)

To generate the numpy and pickle files, use the following script: scr/create_gsi_files.py. This script produces two files:

data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors.npy
data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors_docids.pkl

Both files are perfectly suitable for indexing with Solr and Elasticsearch.

To test the GSI plugin, you will need to upload these files to GSI server for loading them both to Elasticsearch and APU.

Running the BERT search demo

There are two streamlit demos for running BERT search for Solr and Elasticsearch. Each demo compares to BM25 based search. The following assumes that you have bert-as-service up and running (if not, laucnh it with bash start_bert_server.sh) and either Elasticsearch or Solr running with the index containing field with embeddings.

To run a demo, execute the following on the command line from the project root:

# for experiments with Elasticsearch
streamlit run src/search_demo_elasticsearch.py

# for experiments with Solr
streamlit run src/search_demo_solr.py
Owner
Dmitry Kan
I build search engines. Host of the Vector Podcast: https://www.youtube.com/channel/UCCIMPfR7TXyDvlDRXjVhP1g
Dmitry Kan
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretra

Mozilla 6.5k Jan 08, 2023
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 07, 2023
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
Yuqing Xie 2 Feb 17, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Trained T5 and T5-large model for creating keywords from text

text to keywords Trained T5-base and T5-large model for creating keywords from text. Supported languages: ru Pretraining Large version | Pretraining B

Danil 61 Nov 24, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
A tool helps build a talk preview image by combining the given background image and talk event description

talk-preview-img-builder A tool helps build a talk preview image by combining the given background image and talk event description Installation and U

PyCon Taiwan 4 Aug 20, 2022