Exploratory Data Analysis for Employee Retention Dataset

Overview

Exploratory Data Analysis for Employee Retention Dataset

  • Employee turn-over is a very costly problem for companies.
  • The cost of replacing an employee if often larger than 100K USD, taking into account the time spent to interview and find a replacement, placement fees, sign-on bonuses and the loss of productivity for several months.
  • It is only natural then that data science has started being applied to this area.
  • Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as planning new hiring in advance. This application of DS is sometimes called people analytics or people data science
  • We got employee data from a few companies. We have data about all employees who joined from 2011/01/24 to 2015/12/13. For each employee, we also know if they are still at the company as of 2015/12/13 or they have quit.
  • Beside that, we have general info about the employee, such as avg salary during her tenure, dept, and yrs of experience.

Goal:

In this challenge, you have a data set with info about the employees and have to predict when employees are going to quit by understanding the main drivers of employee churn.

  • Assume, for each company, that the headcount starts from zero on 2011/01/23. Estimate employee headcount, for each company, on each day, from 2011/01/24 to 2015/12/13. That is, if by 2012/03/02 2000 people have joined company 1 and 1000 of them have already quit, then company headcount on 2012/03/02 for company 1 would be 1000.
  • You should create a table with 3 columns: day, employee_headcount, company_id. What are the main factors that drive employee churn? Do they make sense? Explain your findings.
  • If you could add to this data set just one variable that could help explain employee churn, what would that be?

Data: (data/employee_retention_data.csv)

Columns:

  • employee_id : id of the employee. Unique by employee per company
  • company_id : company id.
  • dept : employee dept
  • seniority : number of yrs of work experience when hired
  • salary: avg yearly salary of the employee during her tenure within the company
  • join_date: when the employee joined the company, it can only be between 2011/01/24 and 2015/12/13
  • quit_date: when the employee left her job (if she is still employed as of 2015/12/13, this field is NA)

Question 1

Function that returns a list of the names of categorical variables

  • Define a function with name get_categorical_variables
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return list of all categorical fields available.

Question 2

Function that returns the list of the names of numeric variables

  • Define a function with name get_numerical_variables
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return list of all numerical fields available.

Question 3

Function that returns, for numeric variables, mean, median, 25, 50, 75th percentile

  • Define a function with name get_numerical_variables_percentile
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dataframe with following columns:
    • variable name
    • mean
    • median
    • 25th percentile
    • 50th percentile
    • 75th percentile

Question 4

For categorical variables, get modes

  • Define a function with name get_categorical_variables_modes
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dict object with following keys:
    • converted
    • country
    • new_user
    • source

Question 5

For each column, list the count of missing values

  • Define a function with name get_missing_values_count
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dataframe with following columns:
    • var_name
    • missing_value_count

Question 6

Plot histograms using different subplots of all the numerical values in a single plot

  • Define a function with name plot_histogram_with_numerical_values
  • Pass dataframe and list of columns you want to plot as parameter
  • Plot the graph
  • Add column names as plot names (In case you dont understand this please connect with instructor)
  • Change the histogram colour to yellow
  • Fit a normal curve on those histograms (In case you dont understand this please connect with instructor)
Owner
kana sudheer reddy
curently studying in presidency university banglore
kana sudheer reddy
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

7 Nov 20, 2022
Clean and reusable data-sciency notebooks.

KPACUBO KPACUBO is a set Jupyter notebooks focused on the best practices in both software development and data science, namely, code reuse, explicit d

Matvey Morozov 1 Jan 28, 2022
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Correlation-Study-Climate-Change-EV-Adoption Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles I

Jonathan Feng 1 Jan 03, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 07, 2021
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Binomial Option Pricing Calculator Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required) Background A derivative is a fi

sammuhrai 1 Nov 29, 2021
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 19 Nov 24, 2022
Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

Ryan McGeehan 3 Nov 04, 2022
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022