Vectorizers for a range of different data types

Overview

Vectorizers Logo

Travis AppVeyor Codecov CircleCI ReadTheDocs

Vectorizers

There are a large number of machine learning tools for effectively exploring and working with data that is given as vectors (ideally with a defined notion of distance as well). There is also a large volume of data that does not come neatly packaged as vectors. It could be text data, variable length sequence data (either numeric or categorical), dataframes of mixed data types, sets of point clouds, or more. Usually, one way or another, such data can be wrangled into vectors in a way that preserves some relevant properties of the original data. This library seeks to provide a suite of a wide variety of general purpose techniques for such wrangling, making it easier and faster for users to get various kinds of unstructured sequence data into vector formats for exploration and machine learning.

Why use Vectorizers?

Data wrangling can be tedious, error-prone, and fragile when trying to integrate it into production pipelines. The vectorizers library aims to provide a set of easy to use tools for turning various kinds of unstructured sequence data into vectors. By following the scikit-learn transformer API we ensure that any of the vectorizer classes can be trivially integrated into existing sklearn workflows or pipelines. By keeping the vectorization approaches as general as possible (as opposed to specialising on very specific data types), we aim to ensure that a very broad range of data can be handled efficiently. Finally we aim to provide robust techniques with sound mathematical foundations over potentially more powerful but black-box approaches for greater transparency in data processing and transformation.

How to use Vectorizers

Quick start examples to be added soon ...

For further examples on using this library for text we recommend checking out the documentation written up in the EasyData reproducible data science framework by some of our colleagues over at: https://github.com/hackalog/vectorizers_playground

Installing

Vectorizers is designed to be easy to install being a pure python module with relatively light requirements:

  • numpy
  • scipy
  • scikit-learn >= 0.22
  • numba >= 0.51

In the near future the package should be pip installable -- check back for updates:

pip install vectorizers

To manually install this package:

wget https://github.com/TutteInstitute/vectorizers/archive/master.zip
unzip master.zip
rm master.zip
cd vectorizers-master
python setup.py install

Help and Support

This project is still young. The documentation is still growing. In the meantime please open an issue and we will try to provide any help and guidance that we can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. We would greatly appreciate the contribution of tutorial notebooks applying vectorizer tools to diverse or interesting datasets. If you find vectorizers useful for your data please consider contributing an example showing how it can apply to the kind of data you work with!

To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

License

The vectorizers package is 2-clause BSD licensed.

Comments
  • Added SignatureVectorizer (iisignature)

    Added SignatureVectorizer (iisignature)

    I've implemented SignatureVectorizer, which returns the path signatures for a collection of paths.

    This vectorizer essentially wraps the iisignature package such that it fits into the standard sklearn style fit_transform pipeline. While it does require iisignature, the imports are written such that the rest of the library can still be used if the user does not have iisignature installed.

    For more details on the path signature technique, I've found this paper quite instructive: A Primer on the Signature Method in Machine Learning (Chevyrev, I.)

    opened by jh83775 6
  • Add Compression Vectorizers

    Add Compression Vectorizers

    Add Lempel-Ziv and Byte Pair Encodign based vectorizers allowing for vectorization of non-tokenized strings.

    Also includes a basic outline of a distribution vectorizer, but this may require something more powerful than pomegranate.

    opened by lmcinnes 3
  • SlidingWindowTransformer for working with time-series like data

    SlidingWindowTransformer for working with time-series like data

    This is essentially just a Taken's embedding, but gives it tools to work with classical time-series data. I tried to have a reasonable range of options/flexibility in how you sample windows, but would welcome any suggestions for further options. In principle it might be possible to "learn" good window parameters from the input data (right now fit does nothing beyond verifying parameters) but I don't quite know what the right way to do that would be exactly.

    opened by lmcinnes 3
  • Count feature compressor

    Count feature compressor

    It turns our that our data-prep tricks prior and after SVDs for count-based data are generically useful. I tried applying them to info-weighted bag-of-words on 20-newsgroups instead of just a straight SVD and ...

    image

    I decided this is going to be too generically useful not to turn into a standard transformer. In due course we can potentially use this as part of a pipeline for word vectors instead of the reduce_dimension method we have now.

    opened by lmcinnes 3
  • [Question] Vectorizing Terabyte-order data

    [Question] Vectorizing Terabyte-order data

    Hello and thank you for a great package!

    I was wondering whether (and how) ApproximateWassersteinVectorizer would be able to scale up to terabyte-order data or whether you had any pointers for dealing with data of that scale.

    opened by cakiki 2
  • Add more tests; start testing distances

    Add more tests; start testing distances

    Started testing most of the vectorizers at a very basic level. Began a testing module for the distances. I would welcome help writing extra tests for those.

    opened by lmcinnes 2
  • The big cooccurrence refactor

    The big cooccurrence refactor

    The big refactoring is basically complete. The only draw back is that your kernel functions (if you are doing multiple window sizes) need to all be the same. This is a numba list of functions issue. It will take some work to add this back in I think... it can be another PR.

    opened by cjweir 1
  • Ensure contiguous arrays in optimal transport vectorizers

    Ensure contiguous arrays in optimal transport vectorizers

    If a user passes in 'F' layout 2D arrays for vectors it can cause an error from numba that is hard for a user to decode. Remedy this by simply ensuring all the arrays being added are 'C' layout.

    opened by lmcinnes 1
  • added EdgeListVectorizer

    added EdgeListVectorizer

    Added a new class for vectorizing data in the form of row_name, column_name, count triples.
    Added some unit tests. Documentation and an expanded functionality to come at a later date.

    opened by jc-healy 1
  • Add named function kernels for sliding windows

    Add named function kernels for sliding windows

    For now just a couple of simple test kernels; one for count time series anomaly detection, and another to work with time interval time series anomaly detection, which is similar. Enough basics to test out that the named function code-path works.

    opened by lmcinnes 1
  • SequentialDifferenceTransformer, function kernels

    SequentialDifferenceTransformer, function kernels

    Add in a SequentialDifferenceTransformer as a simple to use approach to generating inter-arrival times or similar.

    Allow the SlidingWindowTransformer to take function kernels (numba jitted obviously) for much greater flexibility in how kernels perform.

    opened by lmcinnes 1
  • InformationWeightTransform hates empty columns

    InformationWeightTransform hates empty columns

    We seem to crash kernels when we pass a sparse matrix with an empty column to InformationWeight transformers fit transform.

    It comes up when we've got a fixed token dictionary but our training data is missing some of our vocabulary:

    vect = NgramVectorizer(token_dictionary=token_dictionary).fit_transform(eventid_list)
    vect1 = InformationWeightTransformer().fit_transform(vect)
    
    opened by jc-healy 0
  • [BUG] Can't print TokenCooccurrenceVectorizer object

    [BUG] Can't print TokenCooccurrenceVectorizer object

    Hello!

    Not sure this is actually a bug, but it seemed a bit odd so I figured I'd report it. I just trained a vectorizer using the following:

    %%time
    word_vectorizer = vectorizers.TokenCooccurrenceVectorizer(
        min_document_occurrences=2,
        window_radii=20,          
        window_functions='variable',
        kernel_functions='geometric',            
        n_iter = 3,
        normalize_windows=True,
    ).fit(subset['tokenized'])
    

    When I try to display or print it in a Jupyter Notebook cell I get the following error (actually repeated a few times):

    AttributeError: 'TokenCooccurrenceVectorizer' object has no attribute 'coo_initial_memory'
    
    bug 
    opened by cakiki 3
  • Contributions

    Contributions

    We would greatly appreciate the contribution of tutorial notebooks applying vectorizer tools to diverse or interesting datasets

    Hi! I was wondering what sort of contributions/tutorials you were looking for, or whether you had more concrete contribution wishes.

    opened by cakiki 2
  • Needed to downgrade `numpy` and reinstall `libllvmlite` & `numba`

    Needed to downgrade `numpy` and reinstall `libllvmlite` & `numba`

    I tried installing vectorizers on a fresh pyenv and I got this error:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "vectorizers-master/vectorizers/__init__.py", line 1, in <module>
        from .token_cooccurrence_vectorizer import TokenCooccurrenceVectorizer
      File "vectorizers/token_cooccurrence_vectorizer.py", line 1, in <module>
        from .ngram_vectorizer import ngrams_of
      File "vectorizers/ngram_vectorizer.py", line 2, in <module>
        import numba
      File "numba-0.54.1-py3.9-macosx-11.3-x86_64.egg/numba/__init__.py", line 19, in <module>
        from numba.core import config
      File "numba-0.54.1-py3.9-macosx-11.3-x86_64.egg/numba/core/config.py", line 16, in <module>
        import llvmlite.binding as ll
      File "llvmlite-0.37.0-py3.9.egg/llvmlite/binding/__init__.py", line 4, in <module>
        from .dylib import *
      File "llvmlite-0.37.0-py3.9.egg/llvmlite/binding/dylib.py", line 3, in <module>
        from llvmlite.binding import ffi
      File "llvmlite-0.37.0-py3.9.egg/llvmlite/binding/ffi.py", line 191, in <module>
        raise OSError("Could not load shared object file: {}".format(_lib_name))
    OSError: Could not load shared object file: libllvmlite.dylib
    

    This was resolvable for me by uninstalling and reinstalling libllvmlite & numba after ensuring numpy<1.21, e.g.

    pip install numpy==1.20
    pip uninstall libllvmlite
    pip uninstall numba
    pip install libllvmlite
    pip install numba
    

    This is obviously not a big deal, but in case others bump this, maybe a ref in the README can save them a google. 🤷

    opened by BBischof 2
  • Set up ``__getstate__`` and ``__setstate__`` and add serialization tests

    Set up ``__getstate__`` and ``__setstate__`` and add serialization tests

    Most things should just pickle, however numba based attributes (functions and types such as numba.typed.List) need special handling with __getstate__ and __setstate__ methods to handle conversion of numba bits and pieces.

    To aid in this we should have tests that all classes can be pickled and unpickled successfully. This will help highlight any shortcomings.

    See also #62

    enhancement 
    opened by lmcinnes 0
  • Serialization

    Serialization

    Have you thought about adding serialization/saving to disk? At the very least, would you point out what to store for, say, the ngram vectorizer?

    Thank you very much for the examples on training word/document vectors using this and comparing to USE! Giving it a try now on my data and it looks pretty good! Thank you!

    opened by stevemarin 3
Releases(v0.01)
Owner
Tutte Institute for Mathematics and Computing
Tutte Institute for Mathematics and Computing
Tutte Institute for Mathematics and Computing
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
This is a python script to navigate and extract the FSD50K dataset

FSD50K navigator This is a script I use to navigate the sound dataset from FSK50K.

sweemeng 2 Nov 23, 2021
A distributed block-based data storage and compute engine

Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

Columns AI 131 Dec 26, 2022
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Kevin Schaich 12 Dec 24, 2022
Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 02, 2022
t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

tree-SNE t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in s

Isaac Robinson 61 Nov 21, 2022