Decision Tree Regression algorithm implemented on Python from scratch.

Last update: Dec 22, 2021

Overview

Decision_Tree_Regression

I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when the dataset is a curved line. The algorithm uses decision trees to generate multiple regression lines recursively. The training dataset is split into two parts in each iteration and a regression line is fit. The split is made at the best possible point to minimize the Mean Squared Error (MSE).

The number of regression lines is key. Overfitting occurs if the number is too high and underfitting occurs if the number is too low. There are two hyperparameters we use in this algorithm, maximum depth of the decision trees and the minimum number of samples in a single split. These parameters should be tested and optimized for each dataset.

Creating Datasets

Instead of using datasets downloaded from the internet, I decided to create my own datasets for this project. I generated 4 datasets to test my algorithm: Noisy Sinusoidal Signal, Noisy Second Degree Polynomial, Noisy Linear Line and Noisy Upside Down Triangle Signal. The program generates these datasets when its run and saves the datasets to recreate the results. To generate new datasets, you simply need to delete the first dataset, dataset0.csv file. You can also use your own datasets by uploading them to the same directory as the Python project.

Plotting Results

You can see the results of the sinusoidal signal and the upside down triangle for various hyperparameters. Colored points represent the splits in the training dataset, black lines represent the linear regression line for the corresponding split and the larger gray points represent the test dataset.

It is observed that for these datasets the best value for maximum depth is 4.

Decision Tree Regression algorithm implemented on Python from scratch.

Related tags

Overview

Decision_Tree_Regression

Creating Datasets

Plotting Results

Owner

A visual dataflow programming language for sklearn

Python package for causal inference using Bayesian structural time-series models.

决策树分类与回归模型的实现和可视化

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Scikit learn library models to account for data and concept drift.

Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

Examples and code for the Practical Machine Learning workshop series

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

A concept I came up which ditches the idea of "layers" in a neural network.

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Bayesian optimization in JAX

A Time Series Library for Apache Spark

Nixtla is an open-source time series forecasting library.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Distributed deep learning on Hadoop and Spark clusters.

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Client - 🔥 A tool for visualizing and tracking your machine learning experiments

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.