To effectively detect the faulty wafers

Overview

wafer_fault_detection

Aim of the project:

In electronics, a wafer (also called a slice or substrate) is a thin slice of semiconductor, such as crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate for microelectronic devices built in and upon the wafer. The project aims to successfully identify the state of the provided wafer by classifying it between one of the two-class +1 (good, can be used as a substrate) or -1 (bad, the substrate need to be replaced). In this regard, a training dataset is provided to build a machine learning classification model, which can predict the wafer quality.

Data Description:

The columns of provided data can be classified into 3 parts: wafer name, sensor values and label. The wafer name contains the batch number of the wafer, whereas the sensor values obtained from the measurement carried out on the wafer. The label column contains two unique values +1 and -1 that identifies if the wafer is good or need to be replaced. Additionally, we also require a schema file, which contains all the relevant information about the training files such as file names, length of date value in the file name, length of time value in the file name, number of columns, name of the columns, and their datatype.

Directory creation:

All the necessary folders were created to effectively separate the files so that the end-user can get easy access to them.

Data Validation:

In this step, we matched our dataset with the provided schema file to match the file names, the number of columns it should contain, their names as well as their datatype. If the files matched with the schema values, then it is considered a good file on which we can train or predict our model, if not then the files are considered as bad and moved to the bad folder. Moreover, we also identify the columns with null values. If the whole column data is missing then we also consider the file as bad, on the contrary, if only a fraction of data in a column is missing then we initially fill it with NaN and consider it as good data.

Data Insertion in Database:

First, we create a database with the given name passed. If the database is already created, open the connection to the database. A table with the name- "train_good_raw_dt" or "pred_good_raw_dt" is created in the database, based on training or prediction, for inserting the good data files obtained from the data validation step. If the table is already present, then the new table is not created, and new files are inserted in the already present table as we want training to be done on new as well as old training files. In the end, the data in a stored database is exported as a CSV file to be used for model training.

Data Pre-processing and Model Training:

In the training section, first, the data is checked for the NaN values in the columns. If present, impute the NaN values using the KNN imputer. The column with zero standard deviation was also identified and removed as they don't give any information during model training. A prediction schema was created based on the remained dataset columns. Afterwards, the KMeans algorithm is used to create clusters in the pre-processed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using the "KneeLocator" function. The idea behind clustering is to implement different algorithms to train data in different clusters. The Kmeans model is trained over pre-processed data and the model is saved for further use in prediction. After clusters are created, we find the best model for each cluster. We are using four algorithms, "Random Forest" “K Neighbours”, “Logistic Regression” and "XGBoost". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the best model is selected for each cluster. All the models for every cluster are saved for use in prediction. In the end, the confusion matrix of the model associated with every cluster is also saved to give a glance at the performance of the models.

Prediction:

In data prediction, first, the essential directories are created. The data validation, data insertion and data processing steps are similar to the training section. The KMeans model created during training is loaded, and clusters for the pre-processed prediction data is predicted. Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster. Once the prediction is made for all the clusters, the predictions along with the Wafer names are saved in a CSV file at a given location.

Deployment:

We will be deploying the model to Heroku Cloud.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
The repository is about 100+ python programming exercise problem discussed, explained, and solved in different ways

Break The Ice With Python A journey of 100+ simple yet interesting problems which are explained, solved, discussed in different pythonic ways Introduc

Abdullah Al Masud Tushar 2.2k Jan 04, 2023
Script to use SysWhispers2 direct system calls from Cobalt Strike BOFs

SysWhispers2BOF Script to use SysWhispers2 direct system calls from Cobalt Strike BOFs. Introduction This script was initially created to fix specific

FalconForce 101 Dec 20, 2022
Web app for keeping track of buildings in danger of collapsing in the event of an earthquake

Bulina Roșie 🇷🇴 Un cutremur în București nu este o situație ipotetică. Este o certitudine că acest lucru se va întâmpla. În acest context, la mai bi

Code for Romania 27 Nov 29, 2022
A beautiful and useful prompt for your shell

A Powerline style prompt for your shell A beautiful and useful prompt generator for Bash, ZSH, Fish, and tcsh: Shows some important details about the

Buck Ryan 6k Jan 08, 2023
Implent of Oracle Base line and Lea-3 Baseline

Oracle-Baseline Implent of Oracle Base line and Lea-3 Baseline Oracle Oracle : This model is used to obtain an oracle with a greedy algorithm similar

Andrew Zeng 2 Nov 12, 2021
E-Paper display loop with plugins

PaperPi V3 NOTE This version of PaperPi is under heavy development and is not ready for the average user. We are working on adding more screen compati

Aaron Ciuffo 34 Dec 30, 2022
ASVspoof 2021 Baseline Systems

ASVspoof 2021 Baseline Systems Baseline systems are grouped by task: Speech Deepfake (DF) Logical Access (LA) Physical Access (PA) Please find more de

91 Dec 28, 2022
A passive recon suite designed for fetching the information about web application

FREAK Suite designed for passive recon Usage: python3 setup.py python3 freak.py warning This tool will throw error if you doesn't provide valid api ke

toxic v3nom 7 Feb 17, 2022
Discovering local read-level DNA methylation patterns and DNA methylation heterogeneity in intermediately methylated regions

Discovering local read-level DNA methylation patterns and DNA methylation heterogeneity in intermediately methylated regions

1 Jan 11, 2022
✨ Udemy Coupon Finder For Discord. Supports Turkish & English Language.

Udemy Course Finder Bot | Udemy Kupon Bulucu Botu This bot finds new udemy coupons and sends to the channel. Before Setup You must have python = 3.6

Penguen 4 May 04, 2022
A GUI love Calculator which saves all the User Data in text file(sql based script will be uploaded soon). Interative GUI. Even For Admin Panel

Love-Calculator A GUI love Calculator which saves all the User Data in text file(sql based script will be uploaded soon). Interative GUI, even For Adm

Adithya Krishnan 1 Mar 22, 2022
Extrator de dados do jupiterweb

Extrator de dados do jupiterweb O programa é composto de dois arquivos: Um constando apenas de classes complementares que representam as unidades e as

Bruno Aricó 2 Nov 28, 2022
A visidata plugin for parsing f5 ltm/gtm/audit logs

F5 Log Visidata Plugin This plugin supports the default log format for: /var/log/ltm* /var/log/gtm* /var/log/apm* /var/log/audit* It extracts common l

James Deucker 1 Jan 06, 2022
Python library for datamining glitch information from Gen 1 Pokémon GameBoy ROMs

g1utils This is a Python library for datamining information about various glitches (glitch Pokémon, glitch maps, etc.) from Gen 1 Pokémon ROMs. TODO A

1 Jan 13, 2022
A Python Perforce package that doesn't bring in any other packages to work.

P4CMD 🌴 A Python Perforce package that doesn't bring in any other packages to work. Relies on p4cli installed on the system. p4cmd The p4cmd module h

Niels Vaes 13 Dec 19, 2022
Cvdl-hw2 - Find Contour, Camera Calibration, Augmented Reality and Stereo Disparity Map

opevcvdl-hw2 This project uses openCV and Qt to achieve the requirements. Version Python 3.7 opencv-contrib-python 3.4.2.17 Matplotlib 3.1.1 pyqt5 5.1

Kenny Cheng 3 Aug 17, 2022
Some Python scripts that fx(hash) users might find useful.

fx_hash_utils Some Python scripts that fx(hash) users might find useful. get_images This script downloads all the static images of the tokens generate

30 Oct 05, 2022
Script to quickly get the metrics from Github repos to analyze.

commit-prefix-analysis Script to quickly get the metrics from Github repos to analyze. Setup Install the Github CLI. You'll know its working when runn

David Carpenter 1 Dec 17, 2022
LiteX-Acorn-Baseboard is a baseboard developed around the SQRL's Acorn board (or Nite/LiteFury) expanding their possibilities

LiteX-Acorn-Baseboard is a baseboard developed around the SQRL's Acorn board (or Nite/LiteFury) expanding their possibilities

33 Nov 26, 2022
AIO solution for SSIS students

ssis.bit AIO solution for SSIS students Hardware CircuitPython supports more than 200 different boards. Locally available is the TTGO T8 ESP32-S2 ST77

3 Jun 05, 2022