A dataset for online Arabic calligraphy

Last update: Dec 28, 2022

Overview

Calliar

Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic calligraphy. This repository contains the dataset for the following paper :

Calliar: An Online Handwritten Dataset for Arabic Calligraphy
Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Yousif Ahmed Al-Wajih
https://arxiv.org/abs/2106.10745

Abstract: Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.

Stats

Dataset	# of Samples	# of Words	# of Chars	# of Strokes
Train	2,000	6,065	24,722	36,561
Valid	250	738	2,946	4,410
Test	250	753	3,052	4,601

Dataset Formats

Mainly, we have two basic formats.

.json

Each .json file contains a list of strokes. Each list is a dictionary of the stroke character and the list of points. Each composite character like ت is mapped into a list of primitive strokes i.e ..ٮ . Refer to the paper and to chars.py for more details on the mapping.

.npz

The compressed format of the dataset dataset.npz is only 8.6 MB and uses the Ramer-Douglas-Peucker Algorithm to decrease the number of points per stroke. The python library rdp was used for such task. The .npz format follows the same approach as QuickDraw.

Visualization

The vis.py file contains a list of python methods for easily visualizing the dataset. Here are two examples for drawing a sample json file and creating an animation.

import glob
import matplotlib.pyplot as plt 
import json 
from IPython.core.display import display, HTML, Video
from vis import *

## show an image of the strokes 
drawing = json.load(open(json_path))
print(get_annotation(json_path))
data = convert_3d(drawing)
draw_strokes(data, stroke_width = 2, crop = True)

## create an animation. 
create_animation(json_path)
Video("tmp/video.mp4")

Samples

Animation

video_twitter.mp4

video_twitter_1.mp4

video_twitter_2.mp4

video_twitter_3.mp4

Citation

@misc{alyafeai2021calliar,
      title={Calliar: An Online Handwritten Dataset for Arabic Calligraphy}, 
      author={Zaid Alyafeai and Maged S. Al-shaibani and Mustafa Ghaleb and Yousif Ahmed Al-Wajih},
      year={2021},
      eprint={2106.10745},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A dataset for online Arabic calligraphy

Related tags

Overview

Calliar

Stats

Dataset Formats

.json

.npz

Visualization

Samples

Animation

Citation

Owner

ARBML

Official implementation of "Implicit Neural Representations with Periodic Activation Functions"

Code accompanying "Evolving spiking neuron cellular automata and networks to emulate in vitro neuronal activity," accepted to IEEE SSCI ICES 2021

Code accompanying the NeurIPS 2021 paper "Generating High-Quality Explanations for Navigation in Partially-Revealed Environments"

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Pytorch Implementation of "Diagonal Attention and Style-based GAN for Content-Style disentanglement in image generation and translation" (ICCV 2021)

PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

A Partition Filter Network for Joint Entity and Relation Extraction EMNLP 2021

This repository contains code to run experiments in the paper "Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers."

Elucidating Robust Learning with Uncertainty-Aware Corruption Pattern Estimation

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

Heterogeneous Deep Graph Infomax

Train an RL agent to execute natural language instructions in a 3D Environment (PyTorch)

implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning"

PyTorch ,ONNX and TensorRT implementation of YOLOv4

Breaking the Dilemma of Medical Image-to-image Translation

Official implementation of Representer Point Selection via Local Jacobian Expansion for Post-hoc Classifier Explanation of Deep Neural Networks and Ensemble Models at NeurIPS 2021

Code accompanying "Adaptive Methods for Aggregated Domain Generalization"

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

A TensorFlow implementation of DeepMind's WaveNet paper

Image Captioning using CNN ,LSTM and Attention