Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Last update: Sep 08, 2022

Related tags

Overview

OCR Ground Truth for Historical Commentaries

The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public domain subset of scholarly commentaries on Sophocles' Ajax. Its main goal is to enable the evaluation of the OCR quality on printed materials that contain a mix of Latin and polytonic Greek scripts. It consists of five 19C commentaries written in German, English, and Latin, for a total of 3,356 GT lines.

Data

GT4HistComment are contained in data/, where each sub-folder corresponds to a different publication (i.e. commentary). For each each commentary we provide the following data:

<commentary_id>/GT-pairs: pairs of image/text files for each GT line
<commentary_id>/imgs: original images on which the OCR was performed
<commentary_id>/<commentary_id>_olr.tsv: OLR annotations with image region coordinates and layout type ground truth label

The OCR output produced by the Kraken + Ciaconna pipeline was manually corrected by a pool of annotators using the Lace platform. In order to ensure the quality of the ground truth datasets, an additional verification of all transcriptions made in Lace was carried out by an annotator on line-by-line pairs of image and corresponding text.

Commentary overview

ID	Commentator	Year	Languages	Image source
bsb10234118	Lobeck [1]	1835	Greek, Latin	BSB
sophokle1v3soph	Schneidewin [2]	1853	Greek, German	Internet Archive
cu31924087948174	Campbell [3]	1881	Greek, English	Internet Archive
sophoclesplaysa05campgoog	Jebb [4]	1896	Greek, English	Internet Archive
Wecklein1894	Wecklein [5]	1894 [5]	Greek. German	internal

Stats

Line, word and char counts for each commentary are indicated in the following table. Detailled counts for each region can be found here.

ID	Commentator	Type	lines	words	all chars	greek chars
bsb10234118	Lobeck	training	574	2943	16081	5344
bsb10234118	Lobeck	groundtruth	202	1491	7917	2786
sophokle1v3soph	Schneidewin	training	583	2970	16112	3269
sophokle1v3soph	Schneidewin	groundtruth	382	1599	8436	2191
cu31924087948174	Campbell	groundtruth	464	2987	14291	3566
sophoclesplaysa05campgoog	Jebb	training	561	4102	19141	5314
sophoclesplaysa05campgoog	Jebb	groundtruth	324	2418	10986	2805
Wecklein1894	Wecklein	groundtruth	211	1912	9556	3268

Commentary editions used:

[1] Lobeck, Christian August. 1835. Sophoclis Aiax. Leipzig: Weidmann.
[2] Sophokles. 1853. Sophokles Erklaert von F. W. Schneidewin. Erstes Baendchen: Aias. Philoktetes. Edited by Friedrich Wilhelm Schneidewin. Leipzig: Weidmann.
[3] Lewis Campbell. 1881. Sophocles. Oxford : Clarendon Press.
[4] Wecklein, Nikolaus. 1894. Sophokleus Aias. München: Lindauer.
[5] Jebb, Richard Claverhouse. 1896. Sophocles: The Plays and Fragments. London: Cambridge University Press.

Citation

If you use this dataset in your research, please cite the following publication:

@inproceedings{romanello_optical_2021,
  title = {Optical {{Character Recognition}} of 19th {{Century Classical Commentaries}}: The {{Current State}} of {{Affairs}}},
  booktitle = {The 6th {{International Workshop}} on {{Historical Document Imaging}} and {{Processing}} ({{HIP}} '21)},
  author = {Romanello, Matteo and Sven, Najem-Meyer and Robertson, Bruce},
  year = {2021},
  publisher = {{Association for Computing Machinery}},
  address = {{Lausanne}},
  doi = {10.1145/3476887.3476911}
}

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL), Bruce Robertson (Mount Allison University).

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

Visual 3D Detection Package: This repo aims to provide flexible and reproducible visual 3D detection on KITTI dataset. We expect scripts starting from

305 Dec 19, 2022

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints Official implementation for Reducing Footskate in Human Motion Recon

38 Nov 1, 2022

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

368 Dec 6, 2022

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

GndNet: Fast Ground plane Estimation and Point Cloud Segmentation for Autonomous Vehicles. Authors: Anshul Paigwar, Ozgur Erkent, David Sierra Gonzale

114 Dec 29, 2022

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python THIS PROJECT IS CURRENTLY A WORK IN PROGRESS AND THUS THIS REPOSITORY I

14 Dec 31, 2022

Using LSTM to detect spoofing attacks in an Air-Ground network

Using LSTM to detect spoofing attacks in an Air-Ground network Specifications IDE: Spider Packages: Tensorflow 2.1.0 Keras NumPy Scikit-learn Matplotl

1 Nov 20, 2021

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

ObjectDrawer-ToolBox is a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system, Object Drawer.

77 Jan 5, 2023

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

139 Dec 25, 2022

A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

57 Jan 5, 2023

Comments

adds line-, word- and char-counts to README.md

Adds a table to README.md as suggested by reviewer 1. The table also link to a more complete table, itself a public version of spreadsheet OCR evaluation and stats!detailed_counts. Note that the publishable version is an external reference to our private version, meaning that actualising the latter will also update the former.

opened by sven-nm 0
Pages à exclure - OCR

La page contient les schémas métriques des passages. De ce fait l'OCR ne les reconnaît pas, de plus la correction de l'OCR n'a pas été achevée.

Voici les pages à exclure : sophoclesplaysa05campgoog_0072.png (Jebb, p. 72)

opened by camaya28 0

Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Related tags

Overview

OCR Ground Truth for Historical Commentaries

Data

Commentary overview

Stats

Commentary editions used:

Citation

Acknowledgements

You might also like...

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Using LSTM to detect spoofing attacks in an Air-Ground network

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

A two-stage U-Net for high-fidelity denoising of historical recordings

Comments

adds line-, word- and char-counts to README.md

Pages à exclure - OCR

Releases(v1.0)

v1.0(Sep 24, 2021)

Owner

Ajax Multi-Commentary

[CVPR 2021] "Multimodal Motion Prediction with Stacked Transformers": official code implementation and project page.

Puzzle-CAM: Improved localization via matching partial and full features.

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

PyTorch code for DriveGAN: Towards a Controllable High-Quality Neural Simulation

QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

AntroPy: entropy and complexity of (EEG) time-series in Python

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval - ICCV2021

DyNet: The Dynamic Neural Network Toolkit

Code to accompany our paper "Continual Learning Through Synaptic Intelligence" ICML 2017

Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

This library provides an abstraction to perform Model Versioning using Weight & Biases.

Pure python implementation reverse-mode automatic differentiation

PyTorch-based framework for Deep Hedging

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation

The UI as a mobile display for OP25

Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Train Yolov4 using NBX-Jobs

Multi-modal Content Creation Model Training Infrastructure including the FACT model (AI Choreographer) implementation.

PyTorch implementation of MuseMorphose, a Transformer-based model for music style transfer.