Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

PAGE XML format collection for document image page content and more

Give a solution to recognize MaoYan font.

Image Detector and Convertor App created using python's Pillow, OpenCV, cvlib, numpy and streamlit packages.

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come

【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿，我们会帮你完成一切✨

Pixie - A full-featured 2D graphics library for Python

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pyntcloud is a Python library for working with 3D point clouds.

Document Layout Analysis

Using Opencv ,based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching

Random maze generator and solver

SemTorch

ScanTailor Advanced is the version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes.

The virtual calculator will be above the live streaming from your camera

2 telegram-bots: for image recognition and for text generation