Improving Representations via Similarities

Last update: Jan 08, 2023

Related tags

Miscellaneous embetter

Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments

[WIP] Feature/progress bar
Fixes issue #20

[x] Adds progress bar to all text and image embedders.

[x] Tests for SentenceEncoder.

[ ] Use perfplot for progress bar?

[ ] Can we ensure fast NumPy vectorization while using a progress bar?
opened by CarloLepelaars 5
[BUG] `device` should be attribute on `SentenceEncoder`
The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction: Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder() repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase): . . def __init__(self, name="all-MiniLM-L6-v2", device=None): if not device: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.device = device self.name = name self.tfm = SBERT(name, device=self.device)
opened by CarloLepelaars 4
Color Histograms - Additional Tricks

This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

opened by koaning 4
Support for word embeddings
Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)

Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).

A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

Stéphan
opened by stephantul 3
[FEATURE] SpaCyEmbedder
I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

SpaCy Docs on vector: https://spacy.io/api/doc#vector

Example code for single string:

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("This here text") doc.vector
opened by CarloLepelaars 2
`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

opened by CarloLepelaars 1
Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

opened by kacperlukawski 1
xception mobilenet

https://keras.io/api/applications/

https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

opened by koaning 0

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error: 'SentenceEncoder' object has no attribute 'device'

opened by nicholas-dinicola 6

Releases(0.2.2)

0.2.2(Dec 20, 2022)

Adds GPU support for Sentence Encoders.
Source code(tar.gz)
Source code(zip)
0.2.1(Dec 5, 2022)

Fixed some error messages related to installing extra dependencies.
Source code(tar.gz)
Source code(zip)
0.2.0(Oct 10, 2022)

Fixes a bug related to the Timm vision models.
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 19, 2022)

The first original release. Should have enough components to be interesting.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

Low-level Python CFFI Bindings for Argon2

Low-level Python CFFI Bindings for Argon2 argon2-cffi-bindings provides low-level CFFI bindings to the Argon2 password hashing algorithm including a v

4 Dec 15, 2022

Step by step development of a vending coffee machine project, including tkinter, sqlite3, simulation, etc.

2 Dec 05, 2021

Refer'd Resume Scanner

Refer'd Resume Scanner I wanted to share a free resource we built to assist applicants with resume building. Our resume scanner identifies potential s

74 Mar 07, 2022

Estimate the Market Size for Electic and Plug-In Hybrid Vehicles In Africa

Estimate the Market Size for Electic and Plug-In Hybrid Vehicles In Africa The goal of this repository is to use open data repositories to answer the

0 Feb 21, 2022

Improving the Transferability of Adversarial Examples with Resized-Diverse-Inputs, Diversity-Ensemble and Region Fitting

7 Oct 20, 2022

In this project, we'll be creating a virtual personal assistant for ourselves using our favorite programming language

In this project, we'll be creating a virtual personal assistant for ourselves using our favorite programming language, Python. We can perform several offline as well as online operations using the bo

188 Jan 03, 2023

Combines power of torch, numerical methods to conquer and solve ALL {O,P}DEs

torch_DE_solver Combines power of torch, numerical methods and math overall to conquer and solve ALL {O,P}DEs There are three examples to provide a li

28 Dec 12, 2022

Old versions of Deadcord that are problematic or used as reference.

⚠️ Unmaintained and broken. We have decided to release the old version of Deadcord before our v1.0 rewrite. (which will be equiped with much more feat

1 Feb 10, 2022

The Great Autoencoder Bake Off

The Great Autoencoder Bake Off The companion repository to a post on my blog. It contains all you need to reproduce the results. Features Currently fe

61 Jan 06, 2023

Type Persian without confusing words for yourself and others, in Adobe Connect

About In the Adobe Connect chat section, to type in Persian or Arabic, the written words will be confused and will be written and sent illegibly (This

23 Nov 26, 2021

Simple AoC helper program you can use to develop your own solutions in python.

AoC-Compabion Simple AoC helper program you can use to develop your own solutions in python. Simply install it in your python environment using pip fr

1 Dec 20, 2021

100 Days of Python Programming

100 days of Python Following the initiative of my friend Helber Belmiro, who is almost done with his 100 days of Java, I have decided to start my 100

19 Nov 08, 2021

Convert three types of color in your clipboard and paste it to the color property (gamma correct)

ColorPaster [Blender Addon] Convert three types of color in your clipboard and paste it to the color property (gamma correct) How to Use Hover your mo

13 Oct 31, 2022

CBO uses its Capital Tax model (CBO-CapTax) to estimate the effects of federal taxes on capital income from new investment

CBO’s CapTax Model CBO uses its Capital Tax model (CBO-CapTax) to estimate the effects of federal taxes on capital income from new investment. Specifi

7 Dec 16, 2022

Improving Representations via Similarities

Related tags

Overview

embetter

warning

notes

Comments

Releases(0.2.2)

0.2.2(Dec 20, 2022)

0.2.1(Dec 5, 2022)

0.2.0(Oct 10, 2022)

0.1.0(Sep 19, 2022)

Owner

vincent d warmerdam

Low-level Python CFFI Bindings for Argon2

Step by step development of a vending coffee machine project, including tkinter, sqlite3, simulation, etc.

Refer'd Resume Scanner

Estimate the Market Size for Electic and Plug-In Hybrid Vehicles In Africa

Improving the Transferability of Adversarial Examples with Resized-Diverse-Inputs, Diversity-Ensemble and Region Fitting

In this project, we'll be creating a virtual personal assistant for ourselves using our favorite programming language

Combines power of torch, numerical methods to conquer and solve ALL {O,P}DEs

Old versions of Deadcord that are problematic or used as reference.

The Great Autoencoder Bake Off

Type Persian without confusing words for yourself and others, in Adobe Connect

Simple AoC helper program you can use to develop your own solutions in python.

100 Days of Python Programming

Convert three types of color in your clipboard and paste it to the color property (gamma correct)

CBO uses its Capital Tax model (CBO-CapTax) to estimate the effects of federal taxes on capital income from new investment

👀 nothing to see here

Hands-on machine learning workshop

A competition for forecasting electricity demand at the country-level using a standard backtesting framework

Cobalt Strike Sleep Python Bridge

1. 네이버 카페 댓글을 빨리 다는 기능

Um pequeno painel de consulta