Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Roaster - this gui app + program bundle roasts.

Roaster - this gui app + program bundle roasts.

Harsh ADV) 1 Jan 04, 2022
A youtube videos or channels tag finder python module

A youtube videos or channels tag finder python module

Fayas Noushad 4 Dec 03, 2021
Sunflower-farmers-automated-bot - Sunflower Farmers NFT Game automated bot.IT IS NOT a cheat or hack bot

Sunflower-farmers-auto-bot Sunflower Farmers NFT Game automated bot.IT IS NOT a

Arthur Alves 17 Nov 09, 2022
A simple Python TDLib wrapper

Telegram Forwarder App Description pywtdlib (Python Wrapper TDLib) is a simple synchronous Python wrapper that makes you easy to create new Python Tel

Álvaro Fernández 2 Jan 04, 2023
WikipediaBot from mohirdev.uz

wiki-bot WikipediaBot from mohirdev.uz Requirements wikipedia aiogram Installing wiki/aiogram pip install wikipedia pip install aiogram

Muhammad Ali 5 Sep 28, 2022
💻 A fully functional local AWS cloud stack. Develop and test your cloud & Serverless apps offline!

LocalStack - A fully functional local AWS cloud stack LocalStack provides an easy-to-use test/mocking framework for developing Cloud applications. Cur

LocalStack 45.3k Jan 02, 2023
A Telegram Message Manager Bot by @AbirHasan2005

Messages-Manager-Bot A Telegram Message Manager Bot by @AbirHasan2005. This Bot can delete specific type of messages from Group. I specially use for @

Abir Hasan 32 Nov 12, 2022
Fun telegram bot =)

Recolor Bot About Fun telegram bot, that can change your hair color. Preparations Update package lists sudo apt-get update; Make sure Git and docker-c

Just Koala 4 Jul 09, 2022
Select random winners for a Twitter giveaway

twitter_picker Select random winners for a Twitter giveaway Once the Twitter giveaway (or airdrop) is closed, assign a number to each participant. The

Michael Rawner 1 Dec 11, 2021
Yuichixspam - TLEEGRAM SPAM BOT For Python

𝒀𝑼𝑰𝑪𝑯𝑰 ✘ 𝑺𝑷𝑨𝑴 𝑩𝑶𝑻ノ 🚀 Deploy on Heroku (https://heroku.com/deploy?t

MOHIT X PANDIT 6 Jan 30, 2022
Cedric Owens 16 Sep 27, 2022
A Python wrapper for the Yelp API v2

python-yelp-v2 A Python wrapper for the Yelp API v2. The structure for this was inspired by the python-twitter library, and some internal methods are

Matthew Conlen 12 Oct 24, 2017
Tiktok-bot - A tiktok bot with python

Install the requirements pip install selenium pip install pyfiglet==0.7.5 How ca

Ukis 5 Aug 23, 2022
Rust UserBot, Telegram istifadəsini asanlaşdıran bir proyektdir.

RUST USERBOT 🇦🇿 Rust UserBot, Telegram istifadəsini asanlaşdıran bir proyektdir. Qurulum Heroku Serverə qurulum git clone https://github.com/rustres

1 Oct 25, 2021
Powerful Telegram userbot to turn your PROFILE PICTURE & LAST NAME into a real time clock & to change your BIO automatically.

DATE_TIME_USERBOT-TeLeTiPs Powerful Telegram userbot to turn your PROFILE PICTURE & LAST NAME into a real time clock & to change your BIO automaticall

53 Jan 05, 2023
An EmbedBuilder in Python for discord.py embeds. Pip Module.

Discord.py-MaxEmbeds An EmbedBuilder for Discord bots in Python. You need discord.py to use this module. Installation Step 1 First you have to install

Max Tischberger 6 Jan 13, 2022
A quick-and-dirty script to scrape the daily menu of Leipzig University Mensa and send it to a telegram channel.

Feed me Mensa UL A quick-and-dirty script to scrape the daily menu of Leipzig University Mensa and send it to a telegram channel. For food and cat lov

3 Apr 08, 2022
This Telegram bot allows you to create direct links with pre-filled text to WhatsApp Chats

WhatsApp API Bot Telegram bot to create direct links with pre-filled text for WhatsApp Chats You can check our bot here. The bot is based on the API p

RobotTrick • רובוטריק 17 Aug 20, 2022
💻 Discord-Auto-Translate-Bot - If you type in the chat room, it automatically translates.

💻 Discord-Auto-Translate-Bot - If you type in the chat room, it automatically translates.

LeeSooHyung 2 Jan 20, 2022
Materials to reproduce our findings in our stories, "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up"

Amazon Brands and Exclusives This repository contains code to reproduce the findings featured in our story "Amazon Puts Its Own 'Brands' First Above B

The Markup 60 Nov 11, 2022