Keras implementation of AdaBound

Overview

AdaBound for Keras

Keras port of AdaBound Optimizer for PyTorch, from the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Usage

Add the adabound.py script to your project, and import it. Can be a dropin replacement for Adam Optimizer.

Also supports AMSBound variant of the above, equivalent to AMSGrad from Adam.

from adabound import AdaBound

optm = AdaBound(lr=1e-03,
                final_lr=0.1,
                gamma=1e-03,
                weight_decay=0.,
                amsbound=False)

Results

With a wide ResNet 34 and horizontal flips data augmentation, and 100 epochs of training with batchsize 128, it hits 92.16% (called v1).

Weights are available inside the Releases tab

NOTE

  • The smaller ResNet 20 models have been removed as they did not perform as expected and were depending on a flaw during the initial implementation. The ResNet 32 shows the actual performance of this optimizer.

With a small ResNet 20 and width + height data + horizontal flips data augmentation, and 100 epochs of training with batchsize 1024, it hits 89.5% (called v1).

On a small ResNet 20 with only width and height data augmentations, with batchsize 1024 trained for 100 epochs, the model gets close to 86% on the test set (called v3 below).

Train Set Accuracy

Train Set Loss

Test Set Accuracy

Test Set Loss

Requirements

  • Keras 2.2.4+ & Tensorflow 1.12+ (Only supports TF backend for now).
  • Numpy
Comments
  • suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    same as my PR https://github.com/keras-team/keras-contrib/pull/478 works only with TF backend

    class AdaBound(Optimizer):
        """AdaBound optimizer.
        Default parameters follow those provided in the original paper.
        # Arguments
            lr: float >= 0. Learning rate.
            final_lr: float >= 0. Final learning rate.
            beta_1: float, 0 < beta < 1. Generally close to 1.
            beta_2: float, 0 < beta < 1. Generally close to 1.
            gamma: float >= 0. Convergence speed of the bound function.
            epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
            decay: float >= 0. Learning rate decay over each update.
            weight_decay: Weight decay weight.
            amsbound: boolean. Whether to apply the AMSBound variant of this
                algorithm.
            tf_cpu_mode: only for tensorflow backend
                  0 - default, no changes.
                  1 - allows to train x2 bigger network on same VRAM consuming RAM
                  2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                      and CPU power.
        # References
            - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
              (https://openreview.net/forum?id=Bkg3g2R9FX)
            - [Adam - A Method for Stochastic Optimization]
              (https://arxiv.org/abs/1412.6980v8)
            - [On the Convergence of Adam and Beyond]
              (https://openreview.net/forum?id=ryQu7f-RZ)
        """
    
        def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                     epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
            super(AdaBound, self).__init__(**kwargs)
    
            if not 0. <= gamma <= 1.:
                raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")
    
            with K.name_scope(self.__class__.__name__):
                self.iterations = K.variable(0, dtype='int64', name='iterations')
                self.lr = K.variable(lr, name='lr')
                self.beta_1 = K.variable(beta_1, name='beta_1')
                self.beta_2 = K.variable(beta_2, name='beta_2')
                self.decay = K.variable(decay, name='decay')
    
            self.final_lr = final_lr
            self.gamma = gamma
    
            if epsilon is None:
                epsilon = K.epsilon()
            self.epsilon = epsilon
            self.initial_decay = decay
            self.amsbound = amsbound
    
            self.weight_decay = float(weight_decay)
            self.base_lr = float(lr)
            self.tf_cpu_mode = tf_cpu_mode
    
        def get_updates(self, loss, params):
            grads = self.get_gradients(loss, params)
            self.updates = [K.update_add(self.iterations, 1)]
    
            lr = self.lr
            if self.initial_decay > 0:
                lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                          K.dtype(self.decay))))
    
            t = K.cast(self.iterations, K.floatx()) + 1
    
            # Applies bounds on actual learning rate
            step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                              (1. - K.pow(self.beta_1, t)))
    
            final_lr = self.final_lr * lr / self.base_lr
            lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
            upper_bound = final_lr * (1. + 1. / (self.gamma * t))
    
            e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
            if e: e.__enter__()
            ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            if self.amsbound:
                vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            else:
                vhats = [K.zeros(1) for _ in params]
            if e: e.__exit__(None, None, None)
            
            self.weights = [self.iterations] + ms + vs + vhats
    
            for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
                # apply weight decay
                if self.weight_decay != 0.:
                    g += self.weight_decay * K.stop_gradient(p)
    
                e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
                if e: e.__enter__()                    
                m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
                v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
                if self.amsbound:
                    vhat_t = K.maximum(vhat, v_t)
                    self.updates.append(K.update(vhat, vhat_t))
                if e: e.__exit__(None, None, None)
                
                if self.amsbound:
                    denom = (K.sqrt(vhat_t) + self.epsilon)
                else:
                    denom = (K.sqrt(v_t) + self.epsilon)                        
    
                # Compute the bounds
                step_size_p = step_size * K.ones_like(denom)
                step_size_p_bound = step_size_p / denom
                bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                         lower_bound), upper_bound)
    
                p_t = p - bounded_lr_t
    
                self.updates.append(K.update(m, m_t))
                self.updates.append(K.update(v, v_t))
                new_p = p_t
    
                # Apply constraints.
                if getattr(p, 'constraint', None) is not None:
                    new_p = p.constraint(new_p)
    
                self.updates.append(K.update(p, new_p))
            return self.updates
    
        def get_config(self):
            config = {'lr': float(K.get_value(self.lr)),
                      'final_lr': float(self.final_lr),
                      'beta_1': float(K.get_value(self.beta_1)),
                      'beta_2': float(K.get_value(self.beta_2)),
                      'gamma': float(self.gamma),
                      'decay': float(K.get_value(self.decay)),
                      'epsilon': self.epsilon,
                      'weight_decay': self.weight_decay,
                      'amsbound': self.amsbound}
            base_config = super(AdaBound, self).get_config()
            return dict(list(base_config.items()) + list(config.items()))
    
    opened by iperov 13
  • AdaBound.iterations

    AdaBound.iterations

    this param is not saved.

    I looked at official pytorch implementation from original paper. https://github.com/Luolc/AdaBound/blob/master/adabound/adabound.py

    it has

    # State initialization
    if len(state) == 0:
        state['step'] = 0
    

    state is saved with the optimizer.

    also it has

    # Exponential moving average of gradient values
    state['exp_avg'] = torch.zeros_like(p.data)
    # Exponential moving average of squared gradient values
    state['exp_avg_sq'] = torch.zeros_like(p.data)
    

    these values should also be saved

    So your keras implementation is wrong.

    opened by iperov 10
  • Using SGDM with lr=0.1 leads to not learning

    Using SGDM with lr=0.1 leads to not learning

    Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below. image

    I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

    opened by syorami 10
  • clip by value

    clip by value

    https://github.com/CyberZHG/keras-adabound/blob/master/keras_adabound/optimizers.py

    K.minimum(K.maximum(step, lower_bound), upper_bound)

    will not work?

    opened by iperov 2
  • Unexpected keyword argument passed to optimizer: amsbound

    Unexpected keyword argument passed to optimizer: amsbound

    I installed with pip install keras-adabound imported with: from keras_adabound import AdaBound and declared the optimizer as: opt = AdaBound(lr=1e-03,final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Then, I'm getting the error: TypeError: Unexpected keyword argument passed to optimizer: amsbound

    changing the pip install to adabound (instead of keras-adabound) and the import to from adabound import AdaBound, the keyword amsbound is recognized, but then I get the error: TypeError: __init__() missing 1 required positional argument: 'params'

    Am I mixing something up here or missing something?

    opened by stabilus 0
  • Unclear how to import and use tf.keras version

    Unclear how to import and use tf.keras version

    I have downloaded the files and placed them in a folder in the site packages for my virtual environment but I can't get this to work. I have added the folder path to sys.path and verified it is listed. I'm running Tensorflow 2.1.0. What am I doing wrong?

    opened by mnweaver1 0
  • about lr

    about lr

    Thanks for a good optimizer According to usage optm = AdaBound(lr=1e-03, final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Does the learning rate gradually increase by the number of steps?


    final lr is described as Final learning rate. but it actually is leaning rate relative to base lr and current klearning rate? https://github.com/titu1994/keras-adabound/blob/5ce819b6ca1cd95e32d62e268bd2e0c99c069fe8/adabound.py#L72

    opened by tanakataiki 1
Releases(0.1)
Owner
Somshubra Majumdar
Interested in Machine Learning, Deep Learning and Data Science in general
Somshubra Majumdar
Repository for MeshTalk supplemental material and code once the (already approved) 16 GHS captures our lab will make publicly available are released.

meshtalk This repository contains code to run MeshTalk for face animation from audio. If you use MeshTalk, please cite @inproceedings{richard2021mesht

Meta Research 221 Jan 06, 2023
HairCLIP: Design Your Hair by Text and Reference Image

Overview This repository hosts the official PyTorch implementation of the paper: "HairCLIP: Design Your Hair by Text and Reference Image". Our single

322 Jan 06, 2023
This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack".

Generative Dynamic Patch Attack This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack". Requirements PyTo

Xiang Li 8 Nov 17, 2022
Official page of Struct-MDC (RA-L'22 with IROS'22 option); Depth completion from Visual-SLAM using point & line features

Struct-MDC (click the above buttons for redirection!) Official page of "Struct-MDC: Mesh-Refined Unsupervised Depth Completion Leveraging Structural R

Urban Robotics Lab. @ KAIST 37 Dec 22, 2022
This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields This is the code for "HyperNeRF: A Higher-Dimensional

Google 702 Jan 02, 2023
SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer This repository is the official PyTorch implementation of SwinIR: Image Restoration Using Shifted Win

Jingyun Liang 2.4k Jan 08, 2023
AgML is a comprehensive library for agricultural machine learning

AgML is a comprehensive library for agricultural machine learning. Currently, AgML provides access to a wealth of public agricultural datasets for common agricultural deep learning tasks.

Plant AI and Biophysics Lab 1 Jul 07, 2022
Non-Vacuous Generalisation Bounds for Shallow Neural Networks

This package requires jax, tensorflow, and numpy. Either tensorflow or scikit-learn can be used for loading data. To run in a nix-shell with required

Felix Biggs 0 Feb 04, 2022
SAS output to EXCEL converter for Cornell/MIT Language and acquisition lab

CORNELLSASLAB SAS output to EXCEL converter for Cornell/MIT Language and acquisition lab Instructions: This python code can be used to convert SAS out

2 Jan 26, 2022
MoveNetを用いたPythonでの姿勢推定のデモ

MoveNet-Python-Example MoveNetのPythonでの動作サンプルです。 ONNXに変換したモデルも同梱しています。変換自体を試したい方はMoveNet_tf2onnx.ipynbを使用ください。 2021/08/24時点でTensorFlow Hubで提供されている以下モデ

KazuhitoTakahashi 38 Dec 17, 2022
Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic Scenes", ICCV 2021.

Deep 3D Mask Volume for View Synthesis of Dynamic Scenes Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic S

Ken Lin 17 Oct 12, 2022
Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions"

Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions" Environment requirement This code is based on Python

Rohan Kumar Gupta 1 Dec 19, 2021
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 03, 2023
Temporal-Relational CrossTransformers

Temporal-Relational Cross-Transformers (TRX) This repo contains code for the method introduced in the paper: Temporal-Relational CrossTransformers for

83 Dec 12, 2022
Learning High-Speed Flight in the Wild

Learning High-Speed Flight in the Wild This repo contains the code associated to the paper Learning Agile Flight in the Wild. For more information, pl

Robotics and Perception Group 391 Dec 29, 2022
10th place solution for Google Smartphone Decimeter Challenge at kaggle.

Under refactoring 10th place solution for Google Smartphone Decimeter Challenge at kaggle. Google Smartphone Decimeter Challenge Global Navigation Sat

12 Oct 25, 2022
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
Plover-tapey-tape: an alternative to Plover’s built-in paper tape

plover-tapey-tape plover-tapey-tape is an alternative to Plover’s built-in paper

7 May 29, 2022
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

1 Nov 01, 2021
Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources

Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources (e.g. just the lead vocals).

Victor Basu 14 Nov 07, 2022