An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

Titanic data analysis for python

Fit models to your data in Python with Sherpa.

Pipeline to convert a haploid assembly into diploid

Tokyo 2020 Paralympics, Analytics

nrgpy is the Python package for processing NRG Data Files

CSV database for chihuahua (HUAHUA) blockchain transactions

PyClustering is a Python, C++ data mining library.

PyIOmica (pyiomica) is a Python package for omics analyses.

Orchest is a browser based IDE for Data Science.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Single-Cell Analysis in Python. Scales to >1M cells.

A program that uses an API and a AI model to get info of sotcks

Additional tools for particle accelerator data analysis and machine information

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Shot notebooks resuming the main functions of GeoPandas

Important dataframe statistics with a single command

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Data imputations library to preprocess datasets with missing data

Techdegree Data Analysis Project 2