Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.

Overview

Auto-ViML

banner

Downloads Downloads Downloads standard-readme compliant Python Versions PyPI Version PyPI License

Automatically Build Variant Interpretable ML models fast! Auto_ViML is pronounced "auto vimal" (autovimal logo created by Sanket Ghanmare)

NEW FEATURES in this version are:
1. SMOTE -> now we use SMOTE for imbalanced data. Just set Imbalanced_Flag = True in input below
2. Auto_NLP: It automatically detects Text variables and does NLP processing on those columns
3. Date Time Variables: It automatically detects date time variables and adds extra features
4. Feature Engineering: Now you can perform feature engineering with the available featuretools library.

To upgrade to the best, most stable and full-featured version (anything over > 0.1.600), do one of the following:
Use $ pip install autoviml --upgrade --ignore-installed
or pip install git+https://github.com/AutoViML/Auto_ViML.git

Table of Contents

Background

Read this Medium article to learn how to use Auto_ViML.

Auto_ViML was designed for building High Performance Interpretable Models with the fewest variables. The "V" in Auto_ViML stands for Variable because it tries multiple models with multiple features to find you the best performing model for your dataset. The "i" in Auto_ViML stands for "interpretable" since Auto_ViML selects the least number of features necessary to build a simpler, more interpretable model. In most cases, Auto_ViML builds models with 20-99% fewer features than a similar performing model with all included features (this is based on my trials. Your experience may vary).

Auto_ViML is every Data Scientist's model assistant that:

  1. Helps you with data cleaning: you can send in your entire dataframe as is and Auto_ViML will suggest changes to help with missing values, formatting variables, adding variables, etc. It loves dirty data. The dirtier the better!
  2. Assists you with variable classification: Auto_ViML classifies variables automatically. This is very helpful when you have hundreds if not thousands of variables since it can readily identify which of those are numeric vs categorical vs NLP text vs date-time variables and so on.
  3. Performs feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. Auto_ViML uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
  4. Produces model performance results as graphs automatically. Just set verbose = 1 (or) 2
  5. Handles text, date-time, structs (lists, dictionaries), numeric, boolean, factor and categorical variables all in one model using one straight process.
  6. Allows you to use the featuretools library to do Feature Engineering.
    See example below.
    Let's say you have a few numeric features in your data called "preds". You can 'add','subtract','multiply' or 'divide' these features among themselves using this module. You can optionally send an ID column in the data so that the index ordering is preserved.
    
    from autoviml.feature_engineering import feature_engineering
    print(df[preds].shape)
    dfmod = feature_engineering(df[preds],['add'],'ID')
    print(dfmod.shape)
Auto_ViML is built using scikit-learn, Nnumpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "XGBoost", "Imbalanced-Learn", "CatBoost", and "featuretools" library. We use "SHAP" library for interpretability.
But if you don't have these libraries, Auto_ViML will install those for you automatically.

Install

Prerequsites:

To clone Auto_ViML, it is better to create a new environment, and install the required dependencies:

To install from PyPi:

conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
pip install autoviml
or
pip install git+https://github.com/AutoViML/Auto_ViML.git

To install from source:

cd <AutoVIML_Destination>
git clone [email protected]:AutoViML/Auto_ViML.git
# or download and unzip https://github.com/AutoViML/Auto_ViML/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd Auto_ViML
pip install -r requirements.txt

Usage

In the same directory, open a Jupyter Notebook and use this line to import the .py file:

from autoviml.Auto_ViML import Auto_ViML

Load a data set (any CSV or text file) into a Pandas dataframe and split it into Train and Test dataframes. If you don't have a test dataframe, you can simple assign the test variable below to '' (empty string):

model, features, trainm, testm = Auto_ViML(
    train,
    target,
    test,
    sample_submission,
    hyper_param="GS",
    feature_reduction=True,
    scoring_parameter="weighted-f1",
    KMeans_Featurizer=False,
    Boosting_Flag=False,
    Binning_Flag=False,
    Add_Poly=False,
    Stacking_Flag=False,
    Imbalanced_Flag=False,
    verbose=0,
)

Finally, it writes your submission file to disk in the current directory called mysubmission.csv. This submission file is ready for you to show it clients or submit it to competitions. If no submission file was given, but as long as you give it a test file name, it will create a submission file for you named mySubmission.csv. Auto_ViML works on any Multi-Class, Multi-Label Data Set. So you can have many target labels. You don't have to tell Auto_ViML whether it is a Regression or Classification problem.

Tips for using Auto_ViML:

  1. For Classification problems and imbalanced classes, choose scoring_parameter="balanced_accuracy". It works better.
  2. For Imbalanced Classes (<5% samples in rare class), choose "Imbalanced_Flag"=True. You can also set this flag to True for Regression problems where the target variable might have skewed distributions.
  3. For Multi-Label dataset, the target input target variable can be sent in as a list of variables.
  4. It is recommended that you first set Boosting_Flag=None to get a Linear model. Once you understand that, then you can try to set Boosting_Flag=False to get a Random Forest model. Finally, try Boosting_Flag=True to get an XGBoost model. This is the order that we recommend in order to use Auto_ViML.
  5. Finally try Boosting_Flag="CatBoost" to get a complex but high performing model.
  6. Binning_Flag=True improves a CatBoost model since it adds to the list of categorical vars in data
  7. KMeans_featurizer=True works well in NLP and CatBoost models since it creates cluster variables
  8. Add_Poly=3 improves certain models where there is date-time or categorical and numeric variables
  9. feature_reduction=True is the default and works best. But when you have <10 features in data, set it to False
  10. Do not use Stacking_Flag=True with Linear models since your results may not look great.
  11. Use Stacking_Flag=True only for complex models and as a last step with Boosting_Flag=True or CatBoost
  12. Always set hyper_param ="RS" as input since it runs faster than GridSearchCV and gives better results!
  13. KMeans_Featurizer=True does not work well for small data sets. Use it for data sets > 10,000 rows.
  14. Finally Auto_ViML is meant to be a baseline or challenger solution to your data set. So use it for making quick models that you can compare against or in Hackathons. It is not meant for production!

API

Arguments

  • train: could be a datapath+filename or a dataframe. It will detect which is which and load it.
  • test: could be a datapath+filename or a dataframe. If you don't have any, just leave it as "".
  • submission: must be a datapath+filename. If you don't have any, just leave it as empty string.
  • target: name of the target variable in the data set.
  • sep: if you have a spearator in the file such as "," or "\t" mention it here. Default is ",".
  • scoring_parameter: if you want your own scoring parameter such as "f1" give it here. If not, it will assume the appropriate scoring param for the problem and it will build the model.
  • hyper_param: Tuning options are GridSearch ('GS') and RandomizedSearch ('RS'). Default is 'RS'.
  • feature_reduction: Default = 'True' but it can be set to False if you don't want automatic feature_reduction since in Image data sets like digits and MNIST, you get better results when you don't reduce features automatically. You can always try both and see.
  • KMeans_Featurizer
    • True: Adds a cluster label to features based on KMeans. Use for Linear.
    • False (default) For Random Forests or XGB models, leave it False since it may overfit.
  • Boosting Flag: you have 4 possible choices (default is False):
    • None This will build a Linear Model
    • False This will build a Random Forest or Extra Trees model (also known as Bagging)
    • True This will build an XGBoost model
    • CatBoost This will build a CatBoost model (provided you have CatBoost installed)
  • Add_Poly: Default is 0 which means do-nothing. But it has three interesting settings:
    • 1 Add interaction variables only such as x1x2, x2x3,...x9*10 etc.
    • 2 Add Interactions and Squared variables such as x12, x22, etc.
    • 3 Adds both Interactions and Squared variables such as x1x2, x1**2,x2x3, x2**2, etc.
  • Stacking_Flag: Default is False. If set to True, it will add an additional feature which is derived from predictions of another model. This is used in some cases but may result in overfitting. So be careful turning this flag "on".
  • Binning_Flag: Default is False. It set to True, it will convert the top numeric variables into binned variables through a technique known as "Entropy" binning. This is very helpful for certain datasets (especially hard to build models).
  • Imbalanced_Flag: Default is False. If set to True, it will use SMOTE from Imbalanced-Learn to oversample the "Rare Class" in an imbalanced dataset and make the classes balanced (50-50 for example in a binary classification). This also works for Regression problems where you have highly skewed distributions in the target variable. Auto_ViML creates additional samples using SMOTE for Highly Imbalanced data.
  • verbose: This has 3 possible states:
    • 0 limited output. Great for running this silently and getting fast results.
    • 1 more charts. Great for knowing how results were and making changes to flags in input.
    • 2 lots of charts and output. Great for reproducing what Auto_ViML does on your own.

Return values

  • model: It will return your trained model
  • features: the fewest number of features in your model to make it perform well
  • train_modified: this is the modified train dataframe after removing and adding features
  • test_modified: this is the modified test dataframe with the same transformations as train

Maintainers

Contributing

See the contributing file!

PRs accepted.

License

Apache License 2.0 © 2020 Ram Seshadri

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Owner
AutoViz and Auto_ViML
Automated Machine Learning: Build Variant Interpretable Machine Learning models. Project Created by Ram Seshadri.
AutoViz and Auto_ViML
High-Resolution 3D Human Digitization from A Single Image.

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization (CVPR 2020) News: [2020/06/15] Demo with Google Colab (i

Meta Research 8.4k Dec 29, 2022
Deep Latent Force Models

Deep Latent Force Models This repository contains a PyTorch implementation of the deep latent force model (DLFM), presented in the paper, Compositiona

Tom McDonald 5 Oct 26, 2022
Deduplicating Training Data Makes Language Models Better

Deduplicating Training Data Makes Language Models Better This repository contains code to deduplicate language model datasets as descrbed in the paper

Google Research 431 Dec 27, 2022
UFPR-ADMR-v2 Dataset

UFPR-ADMR-v2 Dataset The UFPR-ADMRv2 dataset contains 5,000 dial meter images obtained on-site by employees of the Energy Company of Paraná (Copel), w

Gabriel Salomon 8 Sep 29, 2022
This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murder rates etc.

Gun-Laws-Classifier This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murde

Awais Saleem 1 Jan 20, 2022
Fashion Landmark Estimation with HRNet

HRNet for Fashion Landmark Estimation (Modified from deep-high-resolution-net.pytorch) Introduction This code applies the HRNet (Deep High-Resolution

SVIP Lab 91 Dec 26, 2022
A toy project using OpenCV and PyMunk

A toy project using OpenCV, PyMunk and Mediapipe the source code for my LindkedIn post It's just a toy project and I didn't write a documentation yet,

Amirabbas Asadi 82 Oct 28, 2022
A multilingual version of MS MARCO passage ranking dataset

mMARCO A multilingual version of MS MARCO passage ranking dataset This repository presents a neural machine translation-based method for translating t

75 Dec 27, 2022
一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

AgentOCR 简介 AgentOCR 是一个基于 PaddleOCR 和 ONNXRuntime 项目开发的一个使用简单、调用方便的 OCR 项目 本项目目前包含 Python Package 【AgentOCR】 和 OCR 标注软件 【AgentOCRLabeling】 使用指南 Pytho

AgentMaker 98 Nov 10, 2022
abess: Fast Best-Subset Selection in Python and R

abess: Fast Best-Subset Selection in Python and R Overview abess (Adaptive BEst Subset Selection) library aims to solve general best subset selection,

297 Dec 21, 2022
Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training Introduction This is a PyTorch implementation of "

weijiawu 34 Nov 09, 2022
DGN pymarl - Implementation of DGN on Pymarl, which could be trained by VDN or QMIX

This is the implementation of DGN on Pymarl, which could be trained by VDN or QM

4 Nov 23, 2022
This repository provides an efficient PyTorch-based library for training deep models.

s3sec Test AWS S3 buckets for read/write/delete access This tool was developed to quickly test a list of s3 buckets for public read, write and delete

Bytedance Inc. 123 Jan 05, 2023
Generate vibrant and detailed images using only text.

CLIP Guided Diffusion From RiversHaveWings. Generate vibrant and detailed images using only text. See captions and more generations in the Gallery See

Clay M. 401 Dec 28, 2022
Automatically erase objects in the video, such as logo, text, etc.

Video-Auto-Wipe Read English Introduction:Here   本人不定期的基于生成技术制作一些好玩有趣的算法模型,这次带来的作品是“视频擦除”方向的应用模型,它实现的功能是自动感知到视频中我们不想看见的部分(譬如广告、水印、字幕、图标等等)然后进行擦除。由于图标擦

seeprettyface.com 141 Dec 26, 2022
An OpenAI-Gym Package for Training and Testing Reinforcement Learning algorithms with OpenSim Models

Authors: Utkarsh A. Mishra and Dr. Dimitar Stanev Advisors: Dr. Dimitar Stanev and Prof. Auke Ijspeert, Biorobotics Laboratory (BioRob), EPFL Video Pl

Utkarsh Mishra 16 Dec 13, 2022
HyperDict - Self linked dictionary in Python

Hyper Dictionary Advanced python dictionary(hash-table), which can link it-self

8 Feb 06, 2022
This is the official repository of XVFI (eXtreme Video Frame Interpolation)

XVFI This is the official repository of XVFI (eXtreme Video Frame Interpolation), https://arxiv.org/abs/2103.16206 Last Update: 20210607 We provide th

Jihyong Oh 195 Dec 29, 2022
NeurIPS-2021: Neural Auto-Curricula in Two-Player Zero-Sum Games.

NAC Official PyTorch implementation of NAC from the paper: Neural Auto-Curricula in Two-Player Zero-Sum Games. We release code for: Gradient based ora

Xidong Feng 19 Nov 11, 2022
Tooling for converting STAC metadata to ODC data model

手语识别 0、使用到的模型 (1). openpose,作者:CMU-Perceptual-Computing-Lab https://github.com/CMU-Perceptual-Computing-Lab/openpose (2). 图像分类classification,作者:Bubbl

Open Data Cube 65 Dec 20, 2022