MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

Overview

Multi-objective Optimized GBT(MooGBT)

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

With AL, we introduce dual variables in Boosting. The dual variables are iteratively optimized and fit within the Boosting iterations. The Boosting objective function is updated with the AL terms and the gradient is readily derived using the GBT gradients. With the gradient and updates of dual variables, we solve the optimization problem by jointly iterating AL and Boosting steps.

This library is motivated by work done in the paper Multi-objective Relevance Ranking, which introduces an Augmented Lagrangian based method to incorporate multiple objectives (MO) in LambdaMART, which is a GBT based search ranking algorithm.

We have modified the scikit-learn GBT implementation [3] to support multi-objective optimization.

Highlights -

  • follows the scikit-learn API conventions
  • supports all hyperparameters present in scikit-learn GBT
  • supports optimization for more than 1 sub-objectives

  • Current support -

  • MooGBTClassifier - "binomial deviance" loss function, for primary and sub-objectives represented as binary variables
  • MooGBTRegressor - "least squares" loss function, for primary and sub-objectives represented as continuous variables

  • Installation

    Moo-GBT can be installed from PyPI

    pip3 install moo-gbt

    Usage

    from multiobjective_gbt import MooGBTClassifier
    
    mu = 100
    b = 0.7 # upper bound on sub-objective cost
    
    constrained_gbt = MooGBTClassifier(
    				loss='deviance',
    				n_estimators=100,
    				constraints=[{"mu":mu, "b":b}], # One Constraint
    				random_state=2021
    )
    constrained_gbt.fit(X_train, y_train)

    Here y_train contains 2 columns, the first column should be the primary objective. The following columns are all the sub-objectives for which constraints have been specified(in the same order).


    Usage Steps

    1. Run unconstrained GBT on Primary Objective. Unconstrained GBT is just the GBTClassifer/GBTRegressor by scikit-learn
    2. Calculate the loss function value for Primary Objective and sub-objective(s)
      • For MooGBTClassifier calculate Log Loss between predicted probability and sub-objective label(s)
      • For MooGBTRegressor calculate mean squared error between predicted value and sub-objective label(s)
    3. Set the value of hyperparamter b, less than the calculated cost in the previous step and run MooGBTClassifer/MooGBTRegressor with this b. The lower the value of b, the more the sub-objective will be optimized

    Example with multiple binary objectives

    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    from multiobjective_gbt import MooGBTClassifier

    We'll use a publicly available dataset - available here

    We define a multi-objective problem on the dataset, with the primary objective as the column "is_booking" and sub-objective as the column "is_package". Both these variables are binary.

    # Preprocessing Data
    train_data = pd.read_csv('examples/expedia-data/expedia-hotel-recommendations/train_data_sample.csv')
    
    po = 'is_booking' # primary objective
    so = 'is_package' # sub-objective
    
    features =  list(train_data.columns)
    features.remove(po)
    outcome_flag =  po
    
    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(
    					train_data[features],
    					train_data[outcome_flag],
    					test_size=0.2,
    					stratify=train_data[[po, so]],
    					random_state=2021
    )
    
    # Creating y_train_, y_test_ with 2 labels
    y_train_ = pd.DataFrame()
    y_train_[po] = y_train
    y_train_[so] = X_train[so]
    
    y_test_ = pd.DataFrame()
    y_test_[po] = y_test
    y_test_[so] = X_test[so]

    MooGBTClassifier without the constraint parameter, works as the standard scikit-learn GBT classifier.

    unconstrained_gbt = MooGBTClassifier(
    				loss='deviance',
    				n_estimators=100,
    				random_state=2021
    )
    
    unconstrained_gbt.fit(X_train, y_train)

    Get train and test sub-objective costs for unconstrained model.

    def get_binomial_deviance_cost(pred, y):
    	return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))
    
    pred_train = unconstrained_gbt.predict_proba(X_train)[:,1]
    pred_test = unconstrained_gbt.predict_proba(X_test)[:,1]
    
    # get sub-objective costs
    so_train_cost = get_binomial_deviance_cost(pred_train, X_train[so])
    so_test_cost = get_binomial_deviance_cost(pred_test, X_test[so])
    
    print (f"""
    Sub-objective cost train - {so_train_cost},
    Sub-objective cost test  - {so_test_cost}
    """)
    Sub-objective cost train - 0.9114,
    Sub-objective cost test  - 0.9145
    

    Constraint is specified as an upper bound on the sub-objective cost. In the unconstrained model, we see the cost of our sub-objective to be ~0.9. So setting upper bounds below 0.9 would optimise the sub-objective.

    b = 0.65 # upper bound on cost
    mu = 100
    constrained_gbt = MooGBTClassifier(
    				loss='deviance',
    				n_estimators=100,
    				constraints=[{"mu":mu, "b":b}], # One Constraint
    				random_state=2021
    )
    
    constrained_gbt.fit(X_train, y_train_)

    From the constrained model, we achieve more than 100% gain in AuROC for the sub-objective while the loss in primary objective AuROC is kept within 6%. The entire study on this dataset can be found in the example notebook.

    Looking at MooGBT primary and sub-objective losses -

    To get raw values of loss functions wrt boosting iteration,

    # return a Pandas dataframe with loss values of objectives wrt boosting iteration
    losses = constrained_gbt.loss_.get_losses()
    losses.head()

    Similarly, you can also look at dual variable(alpha) values for sub-objective(s),

    To get raw values of alphas wrt boosting iteration,

    constrained_gbt.loss_.get_alphas()

    These losses can be used to look at the MooGBT Learning process.

    sns.lineplot(data=losses, x='n_estimators', y='primary_objective', label='primary objective')
    sns.lineplot(data=losses, x='n_estimators', y='sub_objective_1', label='subobjective')
    
    plt.xlabel("# estimators(trees)")
    plt.ylabel("Cost")
    plt.legend(loc = "upper right")

    sns.lineplot(data=losses, x='n_estimators', y='primary_objective', label='primary objective')

    Choosing the right upper bound constraint b and mu value

    The upper bound should be defined based on a acceptable % loss in the primary objective evaluation metric. For stricter upper bounds, this loss would be greater as MooGBT will optimize for the sub-objective more.

    Below table summarizes the effect of the upper bound value on the model performance for primary and sub-objective(s) for the above example.

    %gain specifies the percentage increase in AUROC for the constrained MooGBT model from an uncostrained GBT model.

    b Primary Objective - %gain Sub-Objective - %gain
    0.9 -0.7058 4.805
    0.8 -1.735 40.08
    0.7 -2.7852 62.7144
    0.65 -5.8242 113.9427
    0.6 -9.9137 159.8931

    In general, across our experiments we have found that lower values of mu optimize on the primary objective better while satisfying the sub-objective constraints given enough boosting iterations(n_estimators).

    The below table summarizes the results of varying mu values keeping the upper bound same(b=0.6).

    b mu Primary Objective - %gain Sub-objective - %gain
    0.6 1000 -20.6569 238.1388
    0.6 100 -13.3769 197.8186
    0.6 10 -9.9137 159.8931
    0.6 5 -8.643 146.4171

    MooGBT Learning Process

    MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function.

    MooGBT differs from a standard GBT in the loss function it optimizes the primary objective C1 and the sub-objectives using the Augmented Lagrangian(AL) constrained optimization approach.

    where α = [α1, α2, α3…..] is a vector of dual variables. The Lagrangian is solved by minimizing with respect to the primal variables "s" and maximizing with respect to the dual variables α. Augmented Lagrangian iteratively solves the constraint optimization. Since AL is an iterative approach we integerate it with the boosting iterations of GBT for updating the dual variable α.

    Alpha(α) update -

    At an iteration k, if the constraint t is not satisfied, i.e., Ct(s) > bt, we have  αtk > αtk-1. Otherwise, if the constraint is met, the dual variable α is made 0.

    Public contents

    • _gb.py: contains the MooGBTClassifier and MooGBTRegressor classes. Contains implementation of the fit and predict function. Extended implementation from _gb.py from scikit-learn.

    • _gb_losses.py: contains BinomialDeviance loss function class, LeastSquares loss function class. Extended implementation from _gb_losses.py from scikit-learn.

    More examples

    The examples directory contains several illustrations of how one can use this library:

    References - 

    [1] Multi-objective Ranking via Constrained Optimization - https://arxiv.org/pdf/2002.05753.pdf
    [2] Multi-objective Relevance Ranking - https://sigir-ecom.github.io/ecom2019/ecom19Papers/paper30.pdf
    [3] Scikit-learn GBT Implementation - GBTClassifier and GBTRegressor

    Owner
    Swiggy
    Swiggy
    healthy and lesion models for learning based on the joint estimation of stochasticity and volatility

    health-lesion-stovol healthy and lesion models for learning based on the joint estimation of stochasticity and volatility Reference please cite this p

    5 Nov 01, 2022
    Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

    PyTASER PyTASER is a Python (3.9+) library and set of command-line tools for classifying spectral features in bulk materials, post-DFT. The goal of th

    Materials Design Group 4 Dec 27, 2022
    A python library for easy manipulation and forecasting of time series.

    Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

    Unit8 5.2k Jan 04, 2023
    Tutorial for Decision Threshold In Machine Learning.

    Decision-Threshold-ML Tutorial for improve skills: 'Decision Threshold In Machine Learning' (from GeeksforGeeks) by Marcus Mariano For more informatio

    0 Jan 20, 2022
    A collection of Scikit-Learn compatible time series transformers and tools.

    tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

    Chris Santiago 0 Mar 30, 2022
    Ml based project which uses regression technique to predict the price.

    Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

    Garvit Verma 1 Jul 09, 2022
    Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

    Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

    Amplo 10 May 15, 2022
    Probabilistic time series modeling in Python

    GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

    Amazon Web Services - Labs 3.3k Jan 03, 2023
    Dive into Machine Learning

    Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

    Michael Floering 11.1k Jan 03, 2023
    PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

    A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing va

    Wenjie Du 179 Dec 31, 2022
    Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

    Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

    Priyansh Sharma 7 Nov 09, 2022
    The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

    mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

    shibuiwilliam 9 Sep 09, 2022
    onelearn: Online learning in Python

    onelearn: Online learning in Python Documentation | Reproduce experiments | onelearn stands for ONE-shot LEARNning. It is a small python package for o

    15 Nov 06, 2022
    Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

    Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

    Artsem Zhyvalkouski 64 Nov 30, 2022
    Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

    Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc). Structured a custom ensemble model and a neural network. Found a outperformed

    Chris Yuan 1 Feb 06, 2022
    决策树分类与回归模型的实现和可视化

    DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

    Welt Xing 10 Oct 22, 2022
    Create large-scale ML-driven multiscale simulation ensembles to study the interactions

    MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

    4 Feb 16, 2022
    A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

    Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

    Stox 31 Dec 16, 2022
    K-Means clusternig example with Python and Scikit-learn

    Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

    Emin 1 Dec 13, 2021
    A flexible CTF contest platform for coming PKU GeekGame events

    Project Guiding Star: the Backend A flexible CTF contest platform for coming PKU GeekGame events Still in early development Highlights Not configurabl

    PKU GeekGame 14 Dec 15, 2022