distfit - Probability density fitting

Overview

distfit - Probability density fitting

Python PyPI Version License Github Forks GitHub Open Issues Project Status Downloads Downloads Sphinx Open In Colab

Star it if you like it!

Background

distfit is a python package for probability density fitting across 89 univariate distributions to non-censored data by residual sum of squares (RSS), and hypothesis testing. Probability density fitting is the fitting of a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. distfit scores each of the 89 different distributions for the fit wih the empirical distribution and return the best scoring distribution.

Functionalities

The distfit library is created with classes to ensure simplicity in usage.

# Import library
from distfit import distfit

dist = distfit()        # Specify desired parameters
dist.fit_transform(X)   # Fit distributions on empirical data X
dist.predict(y)         # Predict the probability of the resonse variables
dist.plot()             # Plot the best fitted distribution (y is included if prediction is made)

Installation

Install distfit from PyPI (recommended). distfit is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.

Install from PyPi

pip install distfit

Install directly from github source (beta version)

pip install git+https://github.com/erdogant/distfit#egg=master

Install by cloning (beta version)

git clone https://github.com/erdogant/distfit.git
cd distfit
pip install -U .

Check version number

import distfit
print(distfit.__version__)

Examples

Import distfit library

from distfit import distfit

Create Some random data and model using default parameters:

import numpy as np
X = np.random.normal(0, 2, [100,10])
y = [-8,-6,0,1,2,3,4,5,6]

Specify distfit parameters. In this example nothing is specied and that means that all parameters are set to default.

dist = distfit(todf=True)
dist.fit_transform(X)
dist.plot()

# Prints the screen:
# [distfit] >fit..
# [distfit] >transform..
# [distfit] >[norm      ] [RSS: 0.0133619] [loc=-0.059 scale=2.031] 
# [distfit] >[expon     ] [RSS: 0.3911576] [loc=-6.213 scale=6.154] 
# [distfit] >[pareto    ] [RSS: 0.6755185] [loc=-7.965 scale=1.752] 
# [distfit] >[dweibull  ] [RSS: 0.0183543] [loc=-0.053 scale=1.726] 
# [distfit] >[t         ] [RSS: 0.0133619] [loc=-0.059 scale=2.031] 
# [distfit] >[genextreme] [RSS: 0.0115116] [loc=-0.830 scale=1.964] 
# [distfit] >[gamma     ] [RSS: 0.0111372] [loc=-19.843 scale=0.209] 
# [distfit] >[lognorm   ] [RSS: 0.0111236] [loc=-29.689 scale=29.561] 
# [distfit] >[beta      ] [RSS: 0.0113012] [loc=-12.340 scale=41.781] 
# [distfit] >[uniform   ] [RSS: 0.2481737] [loc=-6.213 scale=12.281] 

Note that the best fit should be [normal], as this was also the input data. However, many other distributions can be very similar with specific loc/scale parameters. It is however not unusual to see gamma and beta distribution as these are the "barba-pappas" among the distributions. Lets print the summary of detected distributions with the Residual Sum of Squares.

# All scores of the tested distributions
print(dist.summary)

# Distribution parameters for best fit
dist.model

# Make plot
dist.plot_summary()

After we have a fitted model, we can make some predictions using the theoretical distributions. After making some predictions, we can plot again but now the predictions are automatically included.

dist.predict(y)
dist.plot()
# 
# Prints to screen:
# [distfit] >predict..
# [distfit] >Multiple test correction..[fdr_bh]

The results of the prediction are stored in y_proba and y_pred

# Show the predictions for y
print(dist.results['y_pred'])
# ['down' 'down' 'none' 'none' 'none' 'none' 'up' 'up' 'up']

# Show the probabilities for y that belong with the predictions
print(dist.results['y_proba'])
# [2.75338375e-05 2.74664877e-03 4.74739680e-01 3.28636879e-01 1.99195071e-01 1.06316132e-01 5.05914722e-02 2.18922761e-02 8.89349927e-03]
 
# All predicted information is also stored in a structured dataframe
print(dist.results['df'])
#    y   y_proba y_pred         P
# 0 -8  0.000028   down  0.000003
# 1 -6  0.002747   down  0.000610
# 2  0  0.474740   none  0.474740
# 3  1  0.328637   none  0.292122
# 4  2  0.199195   none  0.154929
# 5  3  0.106316   none  0.070877
# 6  4  0.050591     up  0.028106
# 7  5  0.021892     up  0.009730
# 8  6  0.008893     up  0.002964

Example if you want to test one specific distributions, such as the normal distribution:

The full list of distributions is listed here: https://erdogant.github.io/distfit/pages/html/Parametric.html

dist = distfit(distr='norm')
dist.fit_transform(X)

# [distfit] >fit..
# [distfit] >transform..
# [distfit] >[norm] [RSS: 0.0151267] [loc=0.103 scale=2.028]

dist.plot()

Example if you want to test multiple distributions, such as the normal and t distribution:

The full list of distributions is listed here: https://erdogant.github.io/distfit/pages/html/Parametric.html

dist = distfit(distr=['norm', 't', 'uniform'])
results = dist.fit_transform(X)

# [distfit] >fit..
# [distfit] >transform..
# [distfit] >[norm   ] [0.00 sec] [RSS: 0.0012337] [loc=0.005 scale=1.982]
# [distfit] >[t      ] [0.12 sec] [RSS: 0.0012336] [loc=0.005 scale=1.982]
# [distfit] >[uniform] [0.00 sec] [RSS: 0.2505846] [loc=-6.583 scale=15.076]
# [distfit] >Compute confidence interval [parametric]

Example to fit for discrete distribution:

from scipy.stats import binom
# Generate random numbers

# Set parameters for the test-case
n = 8
p = 0.5

# Generate 10000 samples of the distribution of (n, p)
X = binom(n, p).rvs(10000)
print(X)

# [5 1 4 5 5 6 2 4 6 5 4 4 4 7 3 4 4 2 3 3 4 4 5 1 3 2 7 4 5 2 3 4 3 3 2 3 5
#  4 6 7 6 2 4 3 3 5 3 5 3 4 4 4 7 5 4 5 3 4 3 3 4 3 3 6 3 3 5 4 4 2 3 2 5 7
#  5 4 8 3 4 3 5 4 3 5 5 2 5 6 7 4 5 5 5 4 4 3 4 5 6 2...]

# Initialize distfit for discrete distribution for which the binomial distribution is used. 
dist = distfit(method='discrete')

# Run distfit to and determine whether we can find the parameters from the data.
dist.fit_transform(X)

# [distfit] >fit..
# [distfit] >transform..
# [distfit] >Fit using binomial distribution..
# [distfit] >[binomial] [SSE: 7.79] [n: 8] [p: 0.499959] [chi^2: 1.11]
# [distfit] >Compute confidence interval [discrete]

# Get the model and best fitted parameters.
print(dist.model)

# {'distr': 
   
    ,
   
#  'params': (8, 0.4999585504197037),
#  'name': 'binom',
#  'SSE': 7.786589839641551,
#  'chi2r': 1.1123699770916502,
#  'n': 8,
#  'p': 0.4999585504197037,
#  'CII_min_alpha': 2.0,
#  'CII_max_alpha': 6.0}

# Best fitted n=8 and p=0.4999 which is great because the input was n=8 and p=0.5
dist.model['n']
dist.model['p']

# Make plot
dist.plot()

# With the fitted model we can start making predictions on new unseen data
y = [0, 1, 10, 11, 12]
results = dist.predict(y)
dist.plot()

# Make plot with the results
dist.plot()

df_results = pd.DataFrame(pd.DataFrame(results))

#   y   y_proba    y_pred   P
#   0   0.004886   down     0.003909
#   1   0.035174   down     0.035174
#   10  0.000000     up     0.000000
#   11  0.000000     up     0.000000
#   12  0.000000     up     0.000000

Example to generate samples based on the fitted distribution:

# import library
from distfit import distfit

# Generate random normal distributed data
X = np.random.normal(0, 2, 10000)
dist = distfit()

# Fit
dist.fit_transform(X)

# The fitted distribution can now be used to generate new samples.
# Generate samples
Xgenerate = dist.generate(n=1000)

Citation

Please cite distfit in your publications if this is useful for your research. See right top panel for the citation entry.


### Maintainer
	Erdogan Taskesen, github: [erdogant](https://github.com/erdogant)
	Contributions are welcome.
Comments
  • Fitting distribution for discrete/categorical data

    Fitting distribution for discrete/categorical data

    Hi

    Is it possible to fit a distribution with distfit library for a discrete variable? For example, let's say I have a survey that has 10 questions with possible values that go from 1 (poor) to 5 (excellent), and 100 persons take the survey.

    Best regards

    opened by ogreyesp 5
  • Can I use the best distribution as the true distribution of my data?

    Can I use the best distribution as the true distribution of my data?

    Here I used distfit to get a distribution that is the closest to my data,but not exactly。When I use the kstest from the scipy library to calculate the p-value to see if I can trust the distribution, the p-value is not ideal.Can I still use distfit to get a distribution to describe my data ?

    opened by yuanfuqiang456 3
  • in plot api, pass fig and ax to give more control to the user's code

    in plot api, pass fig and ax to give more control to the user's code

    Thanks for this great library.

    Purpose of this modification: I have been using it with a multivariate time series dataset. Each dimension gets its own plot and wanted to make use of subplots to see all the dimensions at the same time (in a grid for e.g.)

    Notes: a) I have added fig as the parameter to the plotting API as well. Generally, it is not required. I have done it so as to not create a situation where the number of return values is 1. This way your function always return 2 values (the tuple).

    b) Instead of using plt.xlim and plt.ylim, I am using ax.set_xlim & ax.set_ylim. This should work for previous version and for this modification as well.

    c) For now if the method is 'discrete' then passed fig and axes are ignored since the plot_binom function creates subplots internally.

    opened by ksachdeva 3
  • Add loggamma

    Add loggamma

    I have a problem where loggamma fits best. I ran your script and my own custom script, they agree on beta parameters but the loggamma seemed much more natural. If it's not too much trouble, please consider adding this. If you are using scipy.stats, then it's the same API as others.

    Cool project.

    opened by tirthajyoti 3
  • Two questions about distfit

    Two questions about distfit

    This project looks really great, thank you. I have two questions:

    • How do you set loc = 0 if you know that is the right value for it? I am trying to fit to a symmetric distribution.
    • When I try distfit with distr='full' it gets stuck at levy_l. Is this expected?
    opened by lesshaste 3
  • Plots are not generated

    Plots are not generated

    Hi,

    Both dist.plot() and dist.plot_summary() do not generate plots for me. I am using the bare version of Python (i.e no Conda etc.)

    Am I missing somethings?

    Regards,

    Danish

    opened by danishTUE 2
  • T Distribution Weirdness

    T Distribution Weirdness

    We are using distfit to try to determine if some data we have can be modelled parametrically. For some of the data, the best fitting distribution was a t. Scale and loc are clearly documented, and that is great. There is one remaining parameter to fit a t distribution, and that is degrees of freedom. Except, the one parameter in the distfit output that isn't a scale or loc value is less than one. Obviously, degrees of freedom can't be less than one. So what is that parameter and why isn't degrees of freedom included in the output? It would be helpful for automating our process.

    opened by angelgeek 2
  • Save best parameters

    Save best parameters

    Hello, your package really useful, thanks a lot!

    I have a question: If I want to print the best parameters, what's the syntax? For example, I want to print the best n and p for binomial distribution for the following work.

    thanks a lot

    opened by hummm310 2
  • Remove plt.show() calls

    Remove plt.show() calls

    Thank you for your time spent making this package.

    When you call plt.show(), you've rendered the plot and it can no longer be modified by the user, making it pointless to return the figure and axes objects.

    For example, try:

    fig, ax = dist.plot()
    ax.axvline(x=0)
    plt.title("Blarg!")
    

    Unlike sns plots and dataframe.plot() calls that many are familiar with, the plots of distfit cannot be modified after called. This is surprising to the user (at least it was to me 😀)

    opened by isosphere 2
  • The `distr` parameter should accept a list

    The `distr` parameter should accept a list

    The distr parameter in your core distfit class should accept a custom list of distributions that the user wants to run fitting on. Is there a specific reason you have not allowed it to accept a list?

    opened by tirthajyoti 2
  • `generate` or `rvs` method?

    `generate` or `rvs` method?

    Do you plan to have a generate or rvs method added to a fitted dist class to generate a given number (chosen by a size parameter) of new points with the best-fitted distribution? Here is the imagined code (say I have a dataset called dataset)

    dist = distfit(todf=True)
    dist.fit_transform(dataset)
    
    # Newly generated 1000 points from the best-fitted distribution (based on some score criteria)
    new_data = dist.generate(size=1000)
    
    opened by tirthajyoti 2
  • Robustness of selected data models

    Robustness of selected data models

    Good day!

    Guys, I have found your package really cool) Thanks a lot)

    I have a question:

    Our incoming data can be with anomalies, noise. So, quality of our results is vulnerable to strong/weak outliers. Work with outliers is key feature of your package. Consequently, the quality of predictions based on our data model can be severely compromised. In a sense, we are training and predicting from the same data.

    What is your advice?

    I understand that, it is largely dependent on and provided by the nature of one or another theoretical distribution of data.

    But, better to know, your personal opinion as authors...

    opened by datason 1
  • Add K distribution

    Add K distribution

    What a really awesome repository !

    By the way, K distribution is widely used in the filed of Radar and sonar. It is necessary to estimate the parameters of the K distribution.

    Please consider adding this distribution if possible.

    opened by ShaofengZou 3
  • KS-test in fitdist

    KS-test in fitdist

    Hello everyone,

    I noticed in the code erdogant/distfit/distfit.py that whenever you use the KS statistical test (stats=ks), you call the scipy.stats.ks_2samp to test your data against the distribution you estimated through MLE (maximum likelihood estimation). Is that true? If so, this is wrong, because now the KS statistic depends on your data and the test is no longer valid. In such a case, I would recommend you to have a look at parametric/non-parametric bootstrapping to solve the issue. This reference could be useful https://ui.adsabs.harvard.edu/abs/2006ASPC..351..127B/abstract

    opened by marcellobullo 10
Releases(1.4.5)
pymc-learn: Practical Probabilistic Machine Learning in Python

pymc-learn: Practical Probabilistic Machine Learning in Python Contents: Github repo What is pymc-learn? Quick Install Quick Start Index What is pymc-

pymc-learn 196 Dec 07, 2022
Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining

**Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining.** S

Sebastian Raschka 4k Dec 30, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Data Efficient Decision Making

Data Efficient Decision Making

Microsoft 197 Jan 06, 2023
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
LinearRegression2 Tvads and CarSales

LinearRegression2_Tvads_and_CarSales This project infers the insight that how the TV ads for cars and car Sales are being linked with each other. It i

Ashish Kumar Yadav 1 Dec 29, 2021
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit 🎈 This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 02, 2023
This repo implements a Topological SLAM: Deep Visual Odometry with Long Term Place Recognition (Loop Closure Detection)

This repo implements a topological SLAM system. Deep Visual Odometry (DF-VO) and Visual Place Recognition are combined to form the topological SLAM system.

Best of Australian Centre for Robotic Vision (ACRV) 32 Jun 23, 2022
A webpage that utilizes machine learning to extract sentiments from tweets.

Tweets_Classification_Webpage The goal of this project is to be able to predict what rating customers on social media platforms would give to products

Ayaz Nakhuda 1 Dec 30, 2021
The Emergence of Individuality

The Emergence of Individuality

16 Jul 20, 2022
Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

(intron I nterrogator and C lassifier) intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), usi

Graham Larue 4 Jul 26, 2022
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

6 Jun 30, 2022
Simple, light-weight config handling through python data classes with to/from JSON serialization/deserialization.

Simple but maybe too simple config management through python data classes. We use it for machine learning.

Eren Gölge 67 Nov 29, 2022
Tools for mathematical optimization region

Tools for mathematical optimization region

林景 15 Nov 30, 2022
LightGBM + Optuna: no brainer

AutoLGBM LightGBM + Optuna: no brainer auto train lightgbm directly from CSV files auto tune lightgbm using optuna auto serve best lightgbm model usin

Rishiraj Acharya 22 Dec 15, 2022
MLFlow in a Dockercontainer based on Azurite and Postgres

mlflow-azurite-postgres docker This is a MLFLow image which works with a postgres DB and a local Azure Blob Storage Instance (Azurite). This image is

2 May 29, 2022
Combines Bayesian analyses from many datasets.

PosteriorStacker Combines Bayesian analyses from many datasets. Introduction Method Tutorial Output plot and files Introduction Fitting a model to a d

Johannes Buchner 19 Feb 13, 2022
PennyLane is a cross-platform Python library for differentiable programming of quantum computers

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural ne

PennyLaneAI 1.6k Jan 01, 2023