Large dataset storage format for Pytorch

Last update: Oct 22, 2022

Overview

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Support python 3

pip install h5record

Why?

Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:
- large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
- database approach : depend on what kind of database engine used, usually multi-process read is not supported
- the above method scale non linear in terms of data - storage size
TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )
However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

Support multi-process read
Relatively simple to use and low technical debt
Support compression/de-compression on the fly
Quick load to memory if required

Simple usage

pip install h5record

Sentence Similarity

from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type	File size	Read speed row/second
no compression	2.0G	2084.55 it/s
lzf	1.7G	1496.14 it/s
gzip	1.1G	843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

157 Dec 11, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

Comments

Example about Image dataset

Thanks for your work. Do you have an end to end example about image dataset which includes creating h5records file similar to tfrecord files and then using it in dataloader mechanism just like tf dataset api loader mechanism?
documentation question

opened by meet-minimalist 1

Releases(1.0.4)

1.0.4(Jun 8, 2021)

Minor bug fix
Source code(tar.gz)
Source code(zip)
1.0.3(Jun 6, 2021)
Support for image sequence, float16 sequence, float sequence and float16 datatype

Fix bugs

Source code(tar.gz)
Source code(zip)
1.0.1(Jun 5, 2021)

Source code(tar.gz)
Source code(zip)

Large dataset storage format for Pytorch

Related tags

Overview

H5Record

Why?

Simple usage

Note

Comparison between different compression algorithm

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

N-Omniglot is a large neuromorphic few-shot learning dataset

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

This is the dataset and code release of the OpenRooms Dataset.

Comments

Example about Image dataset

Releases(1.0.4)

1.0.4(Jun 8, 2021)

1.0.3(Jun 6, 2021)

1.0.1(Jun 5, 2021)

Owner

theblackcat102

Pytorch implementation of Straight Sampling Network For Point Cloud Learning (ICIP2021).

Efficient semidefinite bounds for multi-label discrete graphical models.

A PyTorch implementation of "DGC-Net: Dense Geometric Correspondence Network"

DSAC* for Visual Camera Re-Localization (RGB or RGB-D)

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

An Active Automata Learning Library Written in Python

FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

Surrogate-Assisted Genetic Algorithm for Wrapper Feature Selection

Video-Music Transformer

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets" (ECCV 2020 Spotlight)

Deconfounding Temporal Autoencoder: Estimating Treatment Effects over Time Using Noisy Proxies

Feature board for ERPNext

How to train a CNN to 99% accuracy on MNIST in less than a second on a laptop

This is the code repository implementing the paper "TreePartNet: Neural Decomposition of Point Clouds for 3D Tree Reconstruction".

A list of multi-task learning papers and projects.

Enabling dynamic analysis of Legacy Embedded Systems in full emulated environment

Sound Event Detection with FilterAugment

This repository contains the code to replicate the analysis from the paper "Moving On - Investigating Inventors' Ethnic Origins Using Supervised Learning"

End-to-End Speech Processing Toolkit