Data pipelines for both TensorFlow and PyTorch!

Last update: Dec 08, 2021

Overview

rapidnlp-datasets

Data pipelines for both TensorFlow and PyTorch !

If you want to load public datasets, try:

If you want to load local, personal dataset with minimized boilerplate, use rapidnlp-datasets!

installation

pip install -U rapidnlp-datasets

If you work with PyTorch, you should install PyTorch first.

If you work with TensorFlow, you should install TensorFlow first.

Usage

Here are few examples to show you how to use this library.

QuickStart: Sequence Classification Task
QuickStart: Question Answering Task
QuickStart: Token Classification Task
QuickStart: Masked Language Model Task
QuickStart: SimCSE(Sentence Embedding)

sequence-classification-quickstart

In PyTorch,

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForSequenceClassification
>>> dataset = DatasetForSequenceClassification.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
...

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForSequenceClassifiation
>>> dataset, d = TFDatasetForSequenceClassifiation.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
...

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/sequence_classification.tfrecord")
2021-12-08 14:52:41,295    INFO             utils.py  128] Finished to write 2 examples to tfrecords.
>>> dataset = TFDatasetForSequenceClassifiation.from_tfrecord_files("testdata/sequence_classification.tfrecord")
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
...

question-answering-quickstart

In PyTorch:

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForQuestionAnswering
>>>
>>> dataset = DatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
...

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForQuestionAnswering
>>> dataset, d = TFDatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
2021-12-08 15:09:06,747    INFO question_answering_dataset.py  101] Read 3 examples in total.
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
...

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/qa.tfrecord")
2021-12-08 15:09:31,329    INFO             utils.py  128] Finished to write 3 examples to tfrecords.
>>> dataset = TFDatasetForQuestionAnswering.from_tfrecord_files(
        "testdata/qa.tfrecord",
        batch_size=32,
        padding="batch",
    )
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
...

token-classification-quickstart

masked-language-models-quickstart

simcse-quickstart

You might also like...

In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

3 Nov 23, 2022

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

8.9k Dec 30, 2022

🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

30 Oct 9, 2022

AI pipelines for Nvidia Jetson Platform

Jetson Multicamera Pipelines Easy-to-use realtime CV/AI pipelines for Nvidia Jetson Platform. This project: Builds a typical multi-camera pipeline, i.

96 Dec 23, 2022

This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

OpenVINO Inference API This is a repository for an object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operati

68 Nov 24, 2022

Releases(v0.2.0)

v0.2.0(Feb 1, 2022)
Updates:

Refactoring datasets for different tasks, support both pytorch and tensorflow!

Source code(tar.gz)
Source code(zip)
v0.1.0(Dec 8, 2021)
Updates

Refactoring dataset pipelines

Add support for PyTorch

Rename package from naivenlp-datasets to rapidnlp-datasets

Source code(tar.gz)
Source code(zip)
v0.0.6(Nov 22, 2021)
Updates:

Fixed minor bugs

Source code(tar.gz)
Source code(zip)
v0.0.5(Nov 22, 2021)
Updates:

Fixed typo

Source code(tar.gz)
Source code(zip)
v0.0.4(Nov 21, 2021)
Updates:

Add datapipe for masked language model

Source code(tar.gz)
Source code(zip)
v0.0.3(Nov 21, 2021)
Updates:

Add datapipe for sequence classification

Add datapipe for token classification

Add datapipe for SimCSE

Source code(tar.gz)
Source code(zip)
v0.0.1(Nov 19, 2021)
Updates:

Add support for loadding dataset for question answering task

Source code(tar.gz)
Source code(zip)

Data pipelines for both TensorFlow and PyTorch!

Related tags

Overview

rapidnlp-datasets

installation

Usage

sequence-classification-quickstart

question-answering-quickstart

token-classification-quickstart

masked-language-models-quickstart

simcse-quickstart

You might also like...

In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

🤗 Push your spaCy pipelines to the Hugging Face Hub

AI pipelines for Nvidia Jetson Platform

This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

Machine learning framework for both deep learning and traditional algorithms

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

A transformer which can randomly augment VOC format dataset (both image and bbox) online.

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Releases(v0.2.0)

v0.2.0(Feb 1, 2022)

v0.1.0(Dec 8, 2021)

Updates

v0.0.6(Nov 22, 2021)

v0.0.5(Nov 22, 2021)

v0.0.4(Nov 21, 2021)

v0.0.3(Nov 21, 2021)

v0.0.1(Nov 19, 2021)

Owner

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

This game was designed to encourage young people not to gamble on lotteries, as the probablity of correctly guessing the number is infinitesimal!

Uni-Fold: Training your own deep protein-folding models.

A package, and script, to perform imaging transcriptomics on a neuroimaging scan.

Improving 3D Object Detection with Channel-wise Transformer

Meta-meta-learning with evolution and plasticity

Code for "Long Range Probabilistic Forecasting in Time-Series using High Order Statistics"

A Python reference implementation of the CF data model

Python tools for 3D face: 3DMM, Mesh processing(transform, camera, light, render), 3D face representations.

PConv-Keras - Unofficial implementation of "Image Inpainting for Irregular Holes Using Partial Convolutions". Try at: www.fixmyphoto.ai

Official implementation of NeuralFusion: Online Depth Map Fusion in Latent Space

[CVPR 2019 Oral] Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation

a project for 3D multi-object tracking

Code and project page for ICCV 2021 paper "DisUnknown: Distilling Unknown Factors for Disentanglement Learning"

Code of paper "CDFI: Compression-Driven Network Design for Frame Interpolation", CVPR 2021

Reverse engineer your pytorch vision models, in style

Codes for NeurIPS 2021 paper "On the Equivalence between Neural Network and Support Vector Machine".

Prior-Guided Multi-View 3D Head Reconstruction

Social Fabric: Tubelet Compositions for Video Relation Detection

How to Become More Salient? Surfacing Representation Biases of the Saliency Prediction Model