Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Overview

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python πŸ“Š

Last updated on January 30, 2022 by Thomas J. Nicoletti

I would like to preface this document by stating this is my second major project using Python. From my first project to now, I certainly improved upon my understanding of and proficiency with Python, though I still have a long journey ahead of me. I aim to keep learning more and more everyday, and hope this project provides some benefit to the greater applied social science community.

The purpose of this data mining script is to use random forest classification, in conjunction with factor analysis and other analytic techniques, to automatically yield feature importance metrics and related output for a driver analysis. Driver analysis quantifies the importance of independent variables (i.e., drivers) in predicting some outcome variable. Within this repository is a basic, simulated dataset created by me, containing five independent variables and one outcome variable. I am by no means an expert in simulating datasets, so I encourage everyone to use real-world data as a stress test for this statistical tool.

This tool will communicate with users using simple inputs via the Command Prompt. Once all mandatory and optional inputs are received, the analysis will run and send relevant information to the source folder; this potentially includes text files, images, and data files useful for model comprehension and validation, as well as statistically- and conceptually-informed decision-making. The most useful outputs will include the automatically generated feature importance plot and feature quadrant chart.

πŸ’» Installation and Preparation

Please note that excerpts of code provided below are examples based on the driver.py script. As a self-taught programmer, I suggest reading through my insights, mixing them with a quick Google search and your own experiences, and then delving into the script itself.

For this project, I used Python 3.9, the Microsoft Windows operating system, and Microsoft Excel. As such, these act as the prerequisites for utilizing this repository successfully without any additional troubleshooting. Going forward, please ensure everything you download or install for this project ends up in the correct location (e.g., the same source folder).

Use pip to install relevant packages to the proper source folder using the Command Prompt and correct PATH. For example:

pip install numpy
pip install pandas

Please be sure to install each of the following packages: easygui, matplotlib, numpy, pandas, seaborn, string, factor_analyzer, scipy, sklearn, and statsmodels. If required, use the first section of the script to determine lacking dependencies, and proceed accordingly.

πŸ“‘ Script Breakdown

The script begins by calling relevant libraries in Python, as well as defining Mahalanobis distance, which is used to identify multivariate outliers in a later step of this project. Additionally, the Command Prompt will read a simple set of instructions for the user, including important information regarding categorical features, the location of the outcome variable within the dataset, and a required revision for missing data. Furthermore, the script will allow the user to specify a random seed for easy replication of this driver analysis at a later date:

import easygui
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
...
def mahalanobis(x = None, data = None, cov = None):
	mu = x - np.mean(data)
    ...
	return mah.diagonal()
...
seed = int(input('Please enter your numerical random seed for replication purposes: '))
np.random.seed(seed)
text = open('random_seed.txt', 'w')

The script has an entire section dedicated to understanding your dataset, including a quick process for uploading your data file, removing missing data, adding an outlier status variable, determining the final sample size, classifying variables, and so on:

df = pd.read_csv(easygui.fileopenbox())
df.dropna(inplace = True)
df['Mahalanobis'] = mahalanobis(x = df, data = df.iloc[:, :len(df.columns)], cov = None)
df['PValue'] = 1 - chi2.cdf(df['Mahalanobis'], len(df.columns) - 1)
...
n = df.shape[0]
text = open('sample_size.txt', 'w')
...
x = df.iloc[:, :-1]
y = np.ravel(df.iloc[:, -1])
feat = df.columns[:-1]
mean = np.array(df.describe().loc['mean'][:-1])

The script then checks for relevant statistical assumptions needed before determining if your dataset is appropriate for factor analysis. This includes Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin Test. Additionally, a scree plot is produced using principal components analysis to assist in factor analysis decision-making. Once all of this is reviewed, the user will provide relevant inputs regarding their driver analysis model:

bart = calculate_bartlett_sphericity(x)
bart = (str(round(bart[1], 2)))
text = open('sphericity.txt', 'w')
...
kmo = calculate_kmo(x)
kmo = (str(round(kmo[1], 2)))
text = open('factorability.txt', 'w')
...
pca = PCA()
pca.fit(x)
comp = np.arange(pca.n_components_)
plt.figure()

When it comes to choosing whether to run random forest classification on the original variables or transformed factors, the above information is critical. The user will be able to decide both A) whether or not to use factor analysis, and B) how many factors should be used in extraction if applicable. Additionally, if the user opts for the factor analysis route, they will also be able to determine whether all the factors or just the highest loading variable per factor should be used (please see lines 139-150 in the script). The following optional factor analysis and mandatory core analyses will run based on user specifications from the previous step:

fa = FactorAnalysis(n_components = factor, max_iter = 3000, rotation = 'varimax')
...
x = fa.transform(x)
...
load = pd.DataFrame(fa.components_.T.round(2), columns = cols, index = feat)
load.to_csv('factor_loadings.csv')
...
vif = pd.Series(variance_inflation_factor(x.values, i) for i in range(x.shape[1]))
vif = pd.DataFrame(np.array(vif.round(2)), columns = ['Variable Inflation Factor'], index = feat)
vif.T.to_csv('variable_inflation_factors.csv')
clf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_features = 'auto', bootstrap = True, oob_score = True, class_weight = 'balanced').fit(x, y)
oob = str(round(clf.oob_score_, 2)).ljust(4, '0')
pred = clf.predict_proba(x)
loss = str(round(log_loss(y, pred), 2)).ljust(4, '0')
perf = pd.DataFrame({'Out-of-Bag Score': oob, 'Log Loss': loss}, index = ['Estimate'])
perf.to_csv('model_performance.csv')

Please note, the only current rotation method available in Python for factor analysis is varimax, as far as I know. If another rotation method is preferred, I would opt out of the factor analysis route, or try implementing your own solution from scratch. From these results, the feature importance plot and its respective feature quadrant chart can be graphed and saved automatically to the source folder. This is an especially useful and efficient data visualization tool to help express which variable(s) are most important in predicting your outcome. It also saves you quite a bit of time compared to graphing it yourself!

imp = clf.feature_importances_
sort = np.argsort(imp)
plt.figure()
plt.barh(range(len(sort)), imp[sort], color = 'mediumaquamarine', align = 'center')
plt.title('Feature Importance Plot')
plt.xlabel('Derived Importance β†’')
...
imps = []
score = []
for i, feat in enumerate(imp[sort]):
  imps.append(round(feat / imp[sort].mean() * 100, 0))
for i, feat in enumerate(mean[sort]):
  score.append(round(feat / mean[sort].mean() * 100, 0))
quad = pd.DataFrame({'Rescaled Observed Score β†’': score, 'Rescaled Derived Importance β†’': imps,
  'Feature': x.columns[sort]})

To run the script, I suggest using a batch file located in the source folder as follows:

python driver.py
PAUSE

Although the entire script is not reflected in the above breakdown, this information should prove helpful in getting the user accustomed to what this script aims to achieve. If any additional information and/or explanations are desired, please do not hesitate in reaching out!

πŸ“‹ Next Steps

Although I feel this project is solid in its current state, I think one area of improvement would fall in the realm of optimizing the script and making it more pythonic. I am also quite interested in hearing feedback from users, including their field of practice, which variables they used for their analyses, and how satisfied they were with this statistical tool overall.

πŸ’‘ Community Contribution

I am always happy to receive feedback, recommendations, and/or requests from anyone, especially new learners. Please click here for information about the license for this project.

❔ Project Support

Please let me know if you plan to make changes to this project, or adapt the script to a project of your own interest. We can certainly collaborate to make this process as painless as possible!

πŸ“š Additional Resources

  • My current work in market research introduced me to the idea of driver analysis and its usefulness; this statistical tool was created with that space in mind, though it is certainly applicable to all applied areas of business and social science
  • To learn more about calculating random forest classification in Python, click here to access scikit-learn
  • To learn more about calculating factor analysis in Python, click here to access scikit-learn
  • For easy-to-use text editing software, check out Sublime Text for Python and Atom for Markdown
Owner
Thomas
With a passion for research, I am eager to build upon my knowledge of statistical programming. My current areas of focus include data mining and psychometrics.
Thomas
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

AndrΓ© Rodrigues 2 Feb 14, 2022
Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Labs 6.3k Jan 08, 2023
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

RAFAEL RODRIGUES 5 Jan 03, 2023
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyα»…n Quang Huy 5 Sep 28, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Program that predicts the NBA mvp based on data from previous years.

NBA MVP Predictor A machine learning model using RandomForest Regression that predicts NBA MVP's using player data. Explore the docs Β» View Demo Β· Rep

Muhammad Rabee 1 Jan 21, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
Collections of pydantic models

pydantic-collections The pydantic-collections package provides BaseCollectionModel class that allows you to manipulate collections of pydantic models

Roman Snegirev 20 Dec 26, 2022
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022