PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

Overview

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition.


Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. Conformer combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

This repository contains only model code, but you can train with conformer with this repository.

Installation

This project recommends Python 3.7 or higher. We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

  • Numpy: pip install numpy (Refer here for problem installing Numpy).
  • Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the following commands:

pip install -e .

Usage

import torch
import torch.nn as nn
from conformer import Conformer

batch_size, sequence_length, dim = 3, 12345, 80

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')

inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.IntTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

model = nn.DataParallel(Conformer(num_classes=10, input_dim=dim, 
                                  encoder_dim=32, num_encoder_layers=3, 
                                  decoder_dim=32, device=device)).to(device)

# Forward propagate
outputs = model(inputs, input_lengths, targets, target_lengths)

# Recognize input speech
outputs = model.module.recognize(inputs, input_lengths)

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on github or
contacts [email protected] please.

I appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Code Style

I follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.

Reference

Author

Comments
  • Outputs differ from Targets

    Outputs differ from Targets

    @sooftware Can you kindly explain to me why the output lengths and targets are so different? :/ (also in outputs I get negative floats). Example shown below

    The outputs are of shape [32,490,16121] (where 16121 is the len of my vocab) What is the 490 dimensions Also the outputs are probabilities right?

    (outputs)
    tensor([[[-9.7001, -9.6490, -9.6463,  ..., -9.6936, -9.6430, -9.7431],
             [-9.6997, -9.6487, -9.6470,  ..., -9.6903, -9.6450, -9.7416],
             [-9.6999, -9.6477, -9.6479,  ..., -9.6898, -9.6453, -9.7417],
             ...,
             [-9.7006, -9.6449, -9.6513,  ..., -9.6889, -9.6477, -9.7405],
             [-9.7003, -9.6448, -9.6512,  ..., -9.6893, -9.6477, -9.7410],
             [-9.7007, -9.6453, -9.6513,  ..., -9.6892, -9.6466, -9.7403]],
    
            [[-9.6844, -9.6316, -9.6387,  ..., -9.6880, -9.6269, -9.7657],
             [-9.6834, -9.6299, -9.6404,  ..., -9.6872, -9.6283, -9.7642],
             [-9.6834, -9.6334, -9.6387,  ..., -9.6864, -9.6290, -9.7616],
             ...,
             [-9.6840, -9.6299, -9.6431,  ..., -9.6830, -9.6304, -9.7608],
             [-9.6838, -9.6297, -9.6428,  ..., -9.6834, -9.6303, -9.7609],
             [-9.6842, -9.6300, -9.6428,  ..., -9.6837, -9.6292, -9.7599]],
    
            [[-9.6966, -9.6386, -9.6458,  ..., -9.6896, -9.6375, -9.7521],
             [-9.6974, -9.6374, -9.6462,  ..., -9.6890, -9.6369, -9.7516],
             [-9.6974, -9.6405, -9.6456,  ..., -9.6876, -9.6378, -9.7491],
             ...,
             [-9.6978, -9.6336, -9.6493,  ..., -9.6851, -9.6419, -9.7490],
             [-9.6971, -9.6334, -9.6487,  ..., -9.6863, -9.6411, -9.7501],
             [-9.6972, -9.6338, -9.6489,  ..., -9.6867, -9.6396, -9.7497]],
    
            ...,
    
            [[-9.7005, -9.6249, -9.6588,  ..., -9.6762, -9.6557, -9.7555],
             [-9.7028, -9.6266, -9.6597,  ..., -9.6765, -9.6574, -9.7542],
             [-9.7016, -9.6240, -9.6605,  ..., -9.6761, -9.6576, -9.7553],
             ...,
             [-9.7036, -9.6237, -9.6624,  ..., -9.6728, -9.6590, -9.7524],
             [-9.7034, -9.6235, -9.6620,  ..., -9.6735, -9.6589, -9.7530],
             [-9.7038, -9.6240, -9.6622,  ..., -9.6738, -9.6582, -9.7524]],
    
            [[-9.7058, -9.6305, -9.6566,  ..., -9.6739, -9.6557, -9.7466],
             [-9.7061, -9.6273, -9.6569,  ..., -9.6774, -9.6564, -9.7499],
             [-9.7046, -9.6280, -9.6576,  ..., -9.6772, -9.6575, -9.7498],
             ...,
             [-9.7060, -9.6263, -9.6609,  ..., -9.6714, -9.6561, -9.7461],
             [-9.7055, -9.6262, -9.6605,  ..., -9.6723, -9.6558, -9.7469],
             [-9.7058, -9.6270, -9.6606,  ..., -9.6725, -9.6552, -9.7460]],
    
            [[-9.7101, -9.6312, -9.6570,  ..., -9.6736, -9.6551, -9.7420],
             [-9.7102, -9.6307, -9.6579,  ..., -9.6733, -9.6576, -9.7418],
             [-9.7078, -9.6281, -9.6598,  ..., -9.6704, -9.6596, -9.7418],
             ...,
             [-9.7084, -9.6288, -9.6605,  ..., -9.6706, -9.6588, -9.7399],
             [-9.7081, -9.6286, -9.6600,  ..., -9.6714, -9.6584, -9.7406],
             [-9.7085, -9.6291, -9.6601,  ..., -9.6717, -9.6577, -9.7398]]],
           device='cuda:0', grad_fn=<LogSoftmaxBackward0>)
    
    (output_lengths)
    tensor([312, 260, 315, 320, 317, 275, 308, 291, 272, 300, 262, 227, 303, 252,
            298, 256, 303, 251, 284, 259, 263, 286, 209, 262, 166, 194, 149, 212,
            121, 114, 110,  57], device='cuda:0', dtype=torch.int32)
    
    (target_lengths)
    tensor([57, 55, 54, 50, 49, 49, 49, 48, 48, 47, 43, 42, 41, 40, 40, 39, 37, 37,
            36, 36, 36, 35, 34, 33, 29, 27, 26, 24, 20, 19, 17,  9])
    
    

    I am using the following code for training and evaluation

    import torch
    import time
    import sys
    from google.colab import output
    import torch.nn as nn
    from conformer import Conformer
    import torchmetrics
    import random
    
    cuda = torch.cuda.is_available()  
    device = torch.device('cuda' if cuda else 'cpu')
    print('Device:', device)
    
    ################################################################################
    
    def train_model(model, optimizer, criterion, loader, metric):
      running_loss = 0.0
      for i, (audio,audio_len, translations, translation_len) in enumerate(loader):
        # with output.use_tags('some_outputs'):
        #   sys.stdout.write('Batch: '+ str(i+1)+'/290')
        #   sys.stdout.flush();
    
        #sorting inputs and targets to have targets in descending order based on len
        sorted_list,sorted_indices=torch.sort(translation_len,descending=True)
    
        sorted_audio=torch.zeros((32,201,1963),dtype=torch.float)
        sorted_audio_len=torch.zeros(32,dtype=torch.int)
        sorted_translations=torch.zeros((32,78),dtype=torch.int)
        sorted_translation_len=sorted_list
    
        for index, contentof in enumerate(translation_len):
          sorted_audio[index]=audio[sorted_indices[index]]
          sorted_audio_len[index]=audio_len[sorted_indices[index]]
          sorted_translations[index]=translations[sorted_indices[index]]
    
        #transpose inputs from (batch, dim, seq_len) to (batch, seq_len, dim)
        inputs=sorted_audio.to(device)
        inputs=torch.transpose(inputs, 1, 2)
        input_lengths=sorted_audio_len
        targets=sorted_translations.to(device)
        target_lengths=sorted_translation_len
    
        optimizer.zero_grad()
      
        # Forward propagate
        outputs, output_lengths = model(inputs, input_lengths)
        # print(outputs)
    
        # Calculate CTC Loss
        loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)
    
        loss.backward()
        optimizer.step()
    
        # print statistics
        running_loss += loss.item()
    
        output.clear(output_tags='some_outputs')
    
      loss_per_epoch=running_loss/(i+1)
      # print(f'Loss: {loss_per_epoch:.3f}')
    
      return loss_per_epoch
    
    ################################################################################
    
    def eval_model(model, optimizer, criterion, loader, metric):
      running_loss = 0.0
      wer_calc=0.0
      random_index_per_epoch= random.randint(0, 178)
    
      for i, (audio,audio_len, translations, translation_len) in enumerate(loader):
        # with output.use_tags('some_outputs'):
        #   sys.stdout.write('Batch: '+ str(i+1)+'/72')
        #   sys.stdout.flush();
    
        #sorting inputs and targets to have targets in descending order based on len
        sorted_list,sorted_indices=torch.sort(translation_len,descending=True)
    
        sorted_audio=torch.zeros((32,201,1963),dtype=torch.float)
        sorted_audio_len=torch.zeros(32,dtype=torch.int)
        sorted_translations=torch.zeros((32,78),dtype=torch.int)
        sorted_translation_len=sorted_list
    
        for index, contentof in enumerate(translation_len):
          sorted_audio[index]=audio[sorted_indices[index]]
          sorted_audio_len[index]=audio_len[sorted_indices[index]]
          sorted_translations[index]=translations[sorted_indices[index]]
    
        #transpose inputs from (batch, dim, seq_len) to (batch, seq_len, dim)
        inputs=sorted_audio.to(device)
        inputs=torch.transpose(inputs, 1, 2)
        input_lengths=sorted_audio_len
        targets=sorted_translations.to(device)
        target_lengths=sorted_translation_len
    
        # Forward propagate
        outputs, output_lengths = model(inputs, input_lengths)
        # print(outputs)
    
        # Calculate CTC Loss
        loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)
    
        print(output_lengths)
        print(target_lengths)
        # outputs_in_words=words_vocab.convert_pred_to_words(outputs.transpose(0, 1))
        # targets_in_words=words_vocab.convert_pred_to_words(targets)
        # wer=metrics_calculation(metric, outputs_in_words,targets_in_words)
        
        break
    
        if (i==random_index_per_epoch):
            print(outputs_in_words,targets_in_words)
    
        running_loss += loss.item()
        # wer_calc += wer
    
        output.clear(output_tags='some_outputs')
    
      loss_per_epoch=running_loss/(i+1)
      wer_per_epoch=wer_calc/(i+1)
    
      return loss_per_epoch, wer_per_epoch
    
    ################################################################################
    
    def train_eval_model(epochs):
      #conformer model init
      model = nn.DataParallel(Conformer(num_classes=16121, input_dim=201, encoder_dim=32, num_encoder_layers=1)).to(device)
    
      # Optimizers specified in the torch.optim package
      optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
    
      #loss function
      criterion = nn.CTCLoss().to(device)
    
      #metrics init
      metric=torchmetrics.WordErrorRate()
    
      for epoch in range(epochs):
        print("Epoch", epoch+1)
    
        ############################################################################
        #TRAINING      
        model.train()
        print("Training")
    
        # epoch_loss=train_model(model=model,optimizer=optimizer, criterion=criterion, loader=train_loader, metric=metric)
    
        # print(f'Loss: {epoch_loss:.3f}')
        # print(f'WER: {epoch_wer:.3f}')
    
        ############################################################################
        #EVALUATION
        model.train(False)
        print("Validation")
    
        epoch_val_loss, epoch_val_wer=eval_model(model=model,optimizer=optimizer, criterion=criterion, loader=test_loader, metric=metric)
        
        print(f'Loss: {epoch_val_loss:.3f}')     
        print(f'WER: {epoch_val_wer:.3f}')   
    
    ################################################################################
    
    def metrics_calculation(metric, predictions, targets):
        print(predictions)
        print(targets)
        wer=metric(predictions, targets)
    
        return wer
    
    
    
    train_eval_model(1)
    
    opened by jcgeo9 8
  • question about the relative shift function

    question about the relative shift function

    Hi @sooftware, thank you for coding this repo. I have a question about the relative shift function: https://github.com/sooftware/conformer/blob/c76ff16d01b149ae518f3fe66a3dd89c9ecff2fc/conformer/attention.py#L105 I don't quite understand how this function works. Could you elaborate on this?

    An example input and output of size 4 is shown below, which does not really make sense to me.

    Input:

    tensor([[[[-0.9623, -0.3168, -1.1478, -1.3076],
              [ 0.5907, -0.0391, -0.1849, -0.6368],
              [-0.3956,  0.2142, -0.6415,  0.2196],
              [-0.8194, -0.2601,  1.1337, -0.3478]]]])
    

    output:

    tensor([[[[-1.3076,  0.0000,  0.5907, -0.0391],
              [-0.1849, -0.6368,  0.0000, -0.3956],
              [ 0.2142, -0.6415,  0.2196,  0.0000],
              [-0.8194, -0.2601,  1.1337, -0.3478]]]])
    

    Thank you!

    opened by ChanganVR 6
  • Decoding predictions to strings

    Decoding predictions to strings

    Hi, thanks for the great repo.

    the README Usage example gives outputs as a torch tensor of ints. How would you suggest decoding these to strings (the actual speech)?

    Thanks!

    opened by Andrew-Brown1 3
  • mat1 and mat2 shapes cannot be multiplied (1323x9248 and 1568x32)

    mat1 and mat2 shapes cannot be multiplied (1323x9248 and 1568x32)

    These are the shapes of my input, input_len, target, target_len where batch size=27 image

    This is the setup I am running (only using first batch to check that is working before training with all the batches) image

    This is the error I am getting image

    I need some assistance here please:)

    opened by jcgeo9 2
  • error when reproducing the example of use (RuntimeError: Input tensor at index 1 has invalid shape [1, 3085, 8, 10], but expected [1, 3085, 9, 10])

    error when reproducing the example of use (RuntimeError: Input tensor at index 1 has invalid shape [1, 3085, 8, 10], but expected [1, 3085, 9, 10])

    Running the code results in an error:

    import torch
    print(torch.__version__)
    import torch.nn as nn
    from conformer import Conformer
    
    batch_size, sequence_length, dim = 3, 12345, 80
    
    cuda = torch.cuda.is_available()  
    device = torch.device('cuda' if cuda else 'cpu')
    
    inputs = torch.rand(batch_size, sequence_length, dim).to(device)
    input_lengths = torch.IntTensor([12345, 12300, 12000])
    targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                                [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                                [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
    target_lengths = torch.LongTensor([9, 8, 7])
    
    model = nn.DataParallel(Conformer(num_classes=10, input_dim=dim, 
                                      encoder_dim=32, num_encoder_layers=3, 
                                      decoder_dim=32, device=device)).to(device)
    
    # Forward propagate
    outputs = model(inputs, input_lengths, targets, target_lengths)
    
    # Recognize input speech
    outputs = model.module.recognize(inputs, input_lengths)
    
    
    
    1.9.0+cu111
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-12-eea3aeffaf58> in <module>
         21 
         22 # Forward propagate
    ---> 23 outputs = model(inputs, input_lengths, targets, target_lengths)
         24 
         25 # Recognize input speech
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
        167             replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
        168             outputs = self.parallel_apply(replicas, inputs, kwargs)
    --> 169             return self.gather(outputs, self.output_device)
        170 
        171     def replicate(self, module, device_ids):
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py in gather(self, outputs, output_device)
        179 
        180     def gather(self, outputs, output_device):
    --> 181         return gather(outputs, output_device, dim=self.dim)
        182 
        183 
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py in gather(outputs, target_device, dim)
         76     # Setting the function to None clears the refcycle.
         77     try:
    ---> 78         res = gather_map(outputs)
         79     finally:
         80         gather_map = None
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py in gather_map(outputs)
         61         out = outputs[0]
         62         if isinstance(out, torch.Tensor):
    ---> 63             return Gather.apply(target_device, dim, *outputs)
         64         if out is None:
         65             return None
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py in forward(ctx, target_device, dim, *inputs)
         73             ctx.unsqueezed_scalar = False
         74         ctx.input_sizes = tuple(i.size(ctx.dim) for i in inputs)
    ---> 75         return comm.gather(inputs, ctx.dim, ctx.target_device)
         76 
         77     @staticmethod
    
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/comm.py in gather(tensors, dim, destination, out)
        233                 'device object or string instead, e.g., "cpu".')
        234         destination = _get_device_index(destination, allow_cpu=True, optional=True)
    --> 235         return torch._C._gather(tensors, dim, destination)
        236     else:
        237         if destination is not None:
    
    RuntimeError: Input tensor at index 1 has invalid shape [1, 3085, 8, 10], but expected [1, 3085, 9, 10]
    

    I am using version Python 3.8.8. Which version should it work with?

    opened by sovse 2
  • The to.(self.device) in return

    The to.(self.device) in return

    The inputs.to(self.device) in ConformerConvmodule and FeedForwardModule will cause the network graph in tensorboard to fork and appear kind of messy. Is there any special reason to write like that? Since in most cases we should have send both the model and tensor to the device before we input the tensor to the model, probably no more sending action is needed?

    opened by panjiashu 2
  • Invalid size error when running usage in README

    Invalid size error when running usage in README

    Hello sooftware, thank you very much for your wonderful work!

    When I run the sample code in Usage of README:

    import torch
    import torch.nn as nn
    from conformer import Conformer
    
    batch_size, sequence_length, dim = 3, 12345, 80
    
    cuda = torch.cuda.is_available()  
    device = torch.device('cuda' if cuda else 'cpu')
    
    inputs = torch.rand(batch_size, sequence_length, dim).to(device)
    input_lengths = torch.IntTensor([12345, 12300, 12000])
    targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                                [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                                [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
    target_lengths = torch.LongTensor([9, 8, 7])
    
    model = nn.DataParallel(Conformer(num_classes=10, input_dim=dim, 
                                      encoder_dim=32, num_encoder_layers=3, 
                                      decoder_dim=32, device=device)).to(device)
    
    # Forward propagate
    outputs = model(inputs, input_lengths, targets, target_lengths)
    
    # Recognize input speech
    outputs = model.module.recognize(inputs, input_lengths)
    

    I got this error:

    Traceback (most recent call last):
      File "/home/xuchutian/ASR/sooftware-conformer/try.py", line 36, in <module>
        outputs = model(inputs, input_lengths, targets, target_lengths)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
        return self.gather(outputs, self.output_device)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
        return gather(outputs, output_device, dim=self.dim)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
        res = gather_map(outputs)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
        return Gather.apply(target_device, dim, *outputs)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
        return comm.gather(inputs, ctx.dim, ctx.target_device)
      File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/cuda/comm.py", line 165, in gather
        return torch._C._gather(tensors, dim, destination)
    RuntimeError: Gather got an input of invalid size: got [1, 3085, 8, 10], but expected [1, 3085, 9, 10]
    

    May I ask how to solve this error?

    Thank you very much.

    opened by chutianxu 2
  • use relative import

    use relative import

    The import path is now absolute, which requires users to install or configuring the python path before using. However, this can be improved with relative import, so users can use the package without installing it first.

    opened by bridgream 1
  • Remove device from the argument list

    Remove device from the argument list

    This PR solve #33 by removing device from the argument list, which will require the user to manually put input tensors to device as done in the example code in README.

    The property solution mentioned in #33 is not adopted as it does work with nn.DataParallel.

    When the devices of input tensor and module parameters match, the following to device on the input tensor is not required, which are removed in this PR:

    https://github.com/sooftware/conformer/blob/348e8af6c156dae19e311697cbb22b9581880a12/conformer/encoder.py#L117

    Besides, as positional encoding is created from a buffer whose device is changed with the module, we don't have to call to device here, which is also removed in this PR.

    https://github.com/sooftware/conformer/blob/610a77667aafe533a85001298c522e7079503da4/conformer/attention.py#L147

    opened by enhuiz 1
  • Switching device

    Switching device

    Hi. I notice the model requires passing the device as an argument, which may have not been decided yet at the point of the module initialization. Once the device is decided, it seems we cannot easily change it. Do you consider making the device switchable? One solution may be instead of passing the device, add an attribute:

    @property
    def device(self):
        return next(self.parameters()).device
    
    opened by enhuiz 1
  • cannot import name 'Conformer'

    cannot import name 'Conformer'

    Hi when I tried to import conformer, I got this issue >>> from conformer import Conformer Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/workspace/bert/conformer/conformer.py", line 3, in <module> from conformer import Conformer ImportError: cannot import name 'Conformer' from partially initialized module 'conformer' (most likely due to a circular import) (/workspace/bert/conformer/conformer.py) I did as the installation instruction. Would you please see where I might be wrong? Thanks.

    opened by cyy857 1
  • Fix relative positional multi-head attention layer

    Fix relative positional multi-head attention layer

    I referred to fairseq's conformer layer multi-head attention. [code] I also confirmed that it is training.

    1. math.sqrt(dim) -> math.sqrt(d_head)
    2. Add relative positional encoding module
    3. Fix _relative_shift method - input : B X n_head X T X 2T-1 - output : B X n_head X T X T
    opened by upskyy 0
  • Feature Extraction using Pre-trained Conformer Model

    Feature Extraction using Pre-trained Conformer Model

    Is there any possibility to use pre-trained conformer model for feature extraction on another speech dataset. Have you uploaded your pre-trained model and is there any tutorial how to extract embeddings ? Thank you

    opened by shakeel608 0
  • export onnx

    export onnx

    Hi, I am a little confused, if I want to export the onnx, should I use the forward or the recognize function? The difference seems to be that in the recognize function, the decoder loop num is adaptive according to the encoder outputs

    opened by pengaoao 1
Releases(v1.0)
Owner
Soohwan Kim
Current AI Research Engineer at Kakao Brain.
Soohwan Kim
Sequential GCN for Active Learning

Sequential GCN for Active Learning Please cite if using the code: Link to paper. Requirements: python 3.6+ torch 1.0+ pip libraries: tqdm, sklearn, sc

45 Dec 26, 2022
noisy labels; missing labels; semi-supervised learning; entropy; uncertainty; robustness and generalisation.

ProSelfLC: CVPR 2021 ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks For any specific discussion or potential fu

amos_xwang 57 Dec 04, 2022
CAST: Character labeling in Animation using Self-supervision by Tracking

CAST: Character labeling in Animation using Self-supervision by Tracking (Published as a conference paper at EuroGraphics 2022) Note: The CAST paper c

15 Nov 18, 2022
Explainability for Vision Transformers (in PyTorch)

Explainability for Vision Transformers (in PyTorch) This repository implements methods for explainability in Vision Transformers

Jacob Gildenblat 442 Jan 04, 2023
This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

The-Emergence-of-Objectness This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

44 Oct 08, 2022
A Python module for parallel optimization of expensive black-box functions

blackbox: A Python module for parallel optimization of expensive black-box functions What is this? A minimalistic and easy-to-use Python module that e

Paul Knysh 426 Dec 08, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021

CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021 How to cite If you use these data please cite the o

Digital Linguistics 2 Dec 20, 2021
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
Parsing, analyzing, and comparing source code across many languages

Semantic semantic is a Haskell library and command line tool for parsing, analyzing, and comparing source code. In a hurry? Check out our documentatio

GitHub 8.6k Dec 28, 2022
New approach to benchmark VQA models

VQA Benchmarking This repository contains the web application & the python interface to evaluate VQA models. Documentation Please see the documentatio

4 Jul 25, 2022
Official code for article "Expression is enough: Improving traffic signal control with advanced traffic state representation"

1 Introduction Official code for article "Expression is enough: Improving traffic signal control with advanced traffic state representation". The code s

Liang Zhang 10 Dec 10, 2022
UMich 500-Level Mobile Robotics Course

MOBILE ROBOTICS: METHODS & ALGORITHMS - WINTER 2022 University of Michigan - NA 568/EECS 568/ROB 530 For slides, lecture notes, and example codes, see

393 Dec 29, 2022
🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

Hugging Face 1.4k Jan 05, 2023
A new codebase for Group Activity Recognition. It contains codes for ICCV 2021 paper: Spatio-Temporal Dynamic Inference Network for Group Activity Recognition and some other methods.

Spatio-Temporal Dynamic Inference Network for Group Activity Recognition The source codes for ICCV2021 Paper: Spatio-Temporal Dynamic Inference Networ

40 Dec 12, 2022
Self-Supervised Multi-Frame Monocular Scene Flow (CVPR 2021)

Self-Supervised Multi-Frame Monocular Scene Flow 3D visualization of estimated depth and scene flow (overlayed with input image) from temporally conse

Visual Inference Lab @TU Darmstadt 85 Dec 22, 2022
A collection of IPython notebooks covering various topics.

ipython-notebooks This repo contains various IPython notebooks I've created to experiment with libraries and work through exercises, and explore subje

John Wittenauer 2.6k Jan 01, 2023
dualFace: Two-Stage Drawing Guidance for Freehand Portrait Sketching (CVMJ)

dualFace dualFace: Two-Stage Drawing Guidance for Freehand Portrait Sketching (CVMJ) We provide python implementations for our CVM 2021 paper "dualFac

Haoran XIE 46 Nov 10, 2022
Poisson Surface Reconstruction for LiDAR Odometry and Mapping

Poisson Surface Reconstruction for LiDAR Odometry and Mapping Surfels TSDF Our Approach Table: Qualitative comparison between the different mapping te

Photogrammetry & Robotics Bonn 305 Dec 21, 2022
Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

Nervana 3.9k Dec 20, 2022