A fast, flexible, and performant feature selection package for python.

Overview

linselect

A fast, flexible, and performant feature selection package for python.

Package in a nutshell

It's built on stepwise linear regression

When passed data, the underlying algorithm seeks minimal variable subsets that produce good linear fits to the targets. This approach to feature selection strikes a competitive balance between performance, speed, and memory efficiency.

It has a simple API

A simple API makes it easy to quickly rank a data set's features in terms of their added value to a given fit. This is demoed below, where we learn that we can drop column 1 of X and still obtain a fit to y that captures 97.37% of its variance.

from linselect import FwdSelect
import numpy as np

X = np.array([[1,2,4], [1,1,2], [3,2,1], [10,2,2]])
y = np.array([[1], [-1], [-1], [1]])

selector = FwdSelect()
selector.fit(X, y)

print selector.ordered_features
print selector.ordered_cods
# [2, 0, 1]
# [0.47368422, 0.97368419, 1.0]

X_compressed = X[:, selector.ordered_features[:2]]

It's fast

A full sweep on a 1000 feature count data set runs in 10s on my laptop -- about one million times faster (seriously) than standard stepwise algorithms, which are effectively too slow to run at this scale. A 100 count feature set runs in 0.07s.

from linselect import FwdSelect
import numpy as np
import time

X = np.random.randn(5000, 1000)
y = np.random.randn(5000, 1)

selector = FwdSelect()

t1 = time.time()
selector.fit(X, y)
t2 = time.time()
print t2 - t1
# 9.87492

Its scores reveal your effective feature count

By plotting fitted CODs against ranked feature count, one often learns that seemingly high-dimensional problems can actually be understood using only a minority of the available features. The plot below demonstrates this: A fit to one year of AAPL's stock fluctuations -- using just 3 selected stocks as predictors -- nearly matches the performance of a 49-feature fit. The 3-feature fit arguably provides more insight and is certainly easier to reason about (cf. tutorials for details).

apple stock plot

It's flexible

linselect exposes multiple applications of the underlying algorithm. These allow for:

  • Forward, reverse, and general forward-reverse stepwise regression strategies.
  • Supervised applications aimed at a single target variable or simultaneous prediction of multiple target variables.
  • Unsupervised applications. The algorithm can be applied to identify minimal, representative subsets of an available column set. This provides a feature selection analog of PCA -- importantly, one that retains interpretability.

Under the hood

Feature selection algorithms are used to seek minimal column / feature subsets that capture the majority of the useful information contained within a data set. Removal of a selected subset's complement -- the relatively uninformative or redundant features -- can often result in a significant data compression and improved interpretability.

Stepwise selection algorithms work by iteratively updating a model feature set, one at a time [1]. For example, in a given step of a forward process, one considers all of the features that have not yet been added to the model, and then identifies that which would improve the model the most. This is added, and the process is then repeated until all features have been selected. The features that are added first in this way tend to be those that are predictive and also not redundant with those already included in the predictor set. Retaining only these first selected features therefore provides a convenient method for identifying minimal, informative feature subsets.

In general, identifying the optimal feature to add to a model in a given step requires building and scoring each possible updated model variant. This results in a slow process: If there are n features, O(n^2) models must be built to carry out a full ranking. However, the process can be dramatically sped up in the case of linear regression -- thanks to some linear algebra identities that allow one to efficiently update these models as features are either added or removed from their predictor sets [2,3]. Using these update rules, a full feature ranking can be carried out in roughly the same amount of time that is needed to fit only a single model. For n=1000, this means we get an O(n^2) = O(10^6) speed up! linselect makes use of these update rules -- first identified in [2] -- allowing for fast feature selection sweeps.

[1] Introduction to Statistical Learning by G. James, et al -- cf. chapter 6.

[2] M. Efroymson. Multiple regression analysis. Mathematical methods for digital computers, 1:191–203, 1960.

[3] J. Landy. Stepwise regression for unsupervised learning, 2017. arxiv.1706.03265.

Classes, documentation, tests, license

linselect contains three classes: FwdSelect, RevSelect, and GenSelect. As the names imply, these support efficient forward, reverse, and general forward-reverse search protocols, respectively. Each can be used for both supervised and unsupervised analyses.

Docstrings and basic call examples are illustrated for each class in the ./docs folder.

An FAQ and a running list of tutorials are available at efavdb.com/linselect.

Tests: From the root directory,

python setup.py test

This project is licensed under the terms of the MIT license.

Installation

The package can be installed using pip, from pypi

pip install linselect

or from github

pip install git+git://github.com/efavdb/linselect.git

Author

Jonathan Landy - EFavDB

Acknowledgments: Special thanks to P. Callier, P. Spanoudes, and R. Zhou for providing helpful feedback.

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

Nico Van den Hooff 17 Aug 21, 2022
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 19 Nov 24, 2022
The micro-framework to create dataframes from functions.

The micro-framework to create dataframes from functions.

Stitch Fix Technology 762 Jan 07, 2023
High Dimensional Portfolio Selection with Cardinality Constraints

High-Dimensional Portfolio Selecton with Cardinality Constraints This repo contains code for perform proximal gradient descent to solve sample average

Du Jinhong 2 Mar 22, 2022
Geospatial data-science analysis on reasons behind delay in Grab ride-share services

Grab x Pulis Detailed analysis done to investigate possible reasons for delay in Grab services for NUS Data Analytics Competition 2022, to be found in

Keng Hwee 6 Jun 07, 2022