To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Last update: Feb 08, 2022

Related tags

Text Data & NLP Eye_for_the_blind

Overview

Eye for the blind

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset. This kind of model is a use-case for blind people so that they can understand any image with the help of speech. The caption generated through a CNN-RNN model will be converted to speech using a text to speech library.

This problem statement is an application of both deep learning and natural language processing. The features of an image will be extracted by CNN-based encoder and this will be decoded by an RNN model.

The project is an extended application of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper. https://arxiv.org/abs/1502.03044

The dataset is taken from the Kaggle website and it consists of sentence-based image description having a list of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events of the image.

Project Pipeline

The project pipeline can be briefly summarized in the following four steps:

Data Understanding: Here, you need to load the data and understand the representation.
Data preprocessing: In this step, you will process both images and captions to the desired format.
Train/Test Split: Combine both images and captions to create the train and test dataset.
Model-Building: This is the stage where you will create your image captioning model by building Encoder , Attention and Decoder model.
Model Evaluation: Evaluate the models using greedy search and BLEU score.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Related tags

Overview

Eye for the blind

Project Pipeline

Owner

Ragesh Hajela

End-to-End Speech Processing Toolkit

Reading Wikipedia to Answer Open-Domain Questions

Segmenter - Transformer for Semantic Segmentation

The SVO-Probes Dataset for Verb Understanding

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Deduplication is the task to combine different representations of the same real world entity.

wxPython app for converting encodings, modifying and fixing SRT files

YACLC - Yet Another Chinese Learner Corpus

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Question answering app is used to answer for a user given question from user given text.

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

A raytrace framework using taichi language

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Blender addon - Scrub timeline from viewport with a shortcut

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Speech to text streamlit app

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Translate U is capable of translating the text present in an image from one language to the other.

Library for fast text representation and classification.

A combination of autoregressors and autoencoders using XLNet for sentiment analysis