Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Overview

Table of Contents

  1. Overview
  2. Requirements
  3. Demo
  4. Modules

Overview

This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

img

Extract the the text into a CSV format…

PRIZE,ODDS 1 IN:,# OF WINNERS*
$3,9.09,"282,447"
$5,16.66,"154,097"
$7,40.01,"64,169"
$10,26.67,"96,283"
$20,100.00,"25,677"
$30,290.83,"8,829"
$50,239.66,"10,714"
$100,919.66,"2,792"
$500,"6,652.07",386
"$40,000","855,899.99",3
1,i223,
Toa,,
,,
,,"* Based upon 2,567,700"

Requirements

Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.

I haven’t looked into the minimum required versions of these dependencies, but I’ll list the versions that I’m using.

Demo

There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

  1. pip3 install table_ocr
  2. python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That will run against the following image:

img

The following should be printed to your terminal after running the above commands.

Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
Extracted the following tables from the image:
[('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
Processing tables for /tmp/demo_p9on6m8o/simple.png.
Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
Cells:
/tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
/tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
/tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

  • pdf_to_images uses Poppler and ImageMagick to extract images from a PDF.
  • extract_tables finds and extracts table-looking things from an image.
  • extract_cells extracts and orders cells from a table.
  • ocr_image uses Tesseract to OCR the text from an image of a cell.
  • ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.

#!/bin/sh

PDF=$1

python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}

for image in $(cat /tmp/extracted-tables.txt); do
    dir=$(dirname $image)
    python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
done

The package was written in a literate programming style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.

Owner
Eric Ihli
Eric Ihli
Open Source Computer Vision Library

OpenCV: Open Source Computer Vision Library Resources Homepage: https://opencv.org Courses: https://opencv.org/courses Docs: https://docs.opencv.org/m

OpenCV 65.7k Jan 03, 2023
Rest API Written In Python To Classify NSFW Images.

✨ NSFW Classifier API ✨ Rest API Written In Python To Classify NSFW Images. Fastest Solution If you don't want to selfhost it, there's already an inst

Akshay Rajput 23 Dec 30, 2022
Scene text recognition

AttentionOCR for Arbitrary-Shaped Scene Text Recognition Introduction This is the ranked No.1 tensorflow based scene text spotting algorithm on ICDAR2

777 Jan 09, 2023
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 06, 2021
Fatigue Driving Detection Based on Dlib

Fatigue Driving Detection Based on Dlib

5 Dec 14, 2022
Image augmentation library in Python for machine learning.

Augmentor is an image augmentation library in Python for machine learning. It aims to be a standalone library that is platform and framework independe

Marcus D. Bloice 4.8k Jan 04, 2023
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 03, 2023
Balabobapy - Using artificial intelligence algorithms to continue the text

Balabobapy - Using artificial intelligence algorithms to continue the text

qxtony 1 Feb 04, 2022
Brief idea about our project is mentioned in project presentation file.

Brief idea about our project is mentioned in project presentation file. You just have to run attendance.py file in your suitable IDE but we prefer jupyter lab.

Dhruv ;-) 3 Mar 20, 2022
Image Smoothing and Blurring Using OpenCV

Image-Smoothing-and-Blurring-Using-OpenCV This repository contains codes for performing image smoothing and blurring using OpenCV. There are different

Happy N. Monday 3 Feb 15, 2022
FastOCR is a desktop application for OCR API.

FastOCR FastOCR is a desktop application for OCR API. Installation Arch Linux fastocr-git @ AUR Build from AUR or install with your favorite AUR helpe

Bruce Zhang 58 Jan 07, 2023

Installations for running keras-theano on GPU Upgrade pip and install opencv2 cd ~ pip install --upgrade pip pip install opencv-python Upgrade keras

Berat Kurar Barakat 14 Sep 30, 2022
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

CUTIE TensorFlow implementation of the paper "CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor." Xiaohu

Zhao,Xiaohui 147 Dec 20, 2022
color detection using python

colordetection color detection using python In this color detection Python project, we are going to build an application through which you can automat

Ruchith Kumar 1 Nov 04, 2021
Code for paper "Role-based network embedding via structural features reconstruction with degree-regularized constraint"

Role-based network embedding via structural features reconstruction with degree-regularized constraint Train python main.py --dataset brazil-flights

wang zhang 1 Jun 28, 2022
Color Picker and Color Detection tool for METR4202

METR4202 Color Detection Help This is sample code that can be used for the METR4202 project demo. There are two files provided, both running on Python

Miguel Valencia 1 Oct 23, 2021
Table Extraction Tool

Tree Structure - Table Extraction Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables.

HazyResearch 88 Jun 02, 2022
Text language identification using Wikipedia data

Text language identification using Wikipedia data The aim of this project is to provide high-quality language detection over all the web's languages.

Vsevolod Dyomkin 28 Jul 09, 2022
Super Mario Game With Python

Super_Mario Hello all this is a simple python program which tries to use our body as a controller for the super mario game Here I have used media pipe

Adarsh Badagala 219 Nov 25, 2022
Handwritten Number Recognition using CNN and Character Segmentation

Handwritten-Number-Recognition-With-Image-Segmentation Info About this repository This Repository is aimed at reading handwritten images of numbers an

Sparsha Saha 17 Aug 25, 2022