A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

Overview

TorchArrow (Warning: Unstable Prototype)

This is a prototype library currently under heavy development. It does not currently have stable releases, and as such will likely be modified significantly in backwards compatibility breaking ways until beta release (targeting early 2022). If you have suggestions on the API or use cases you would like to be covered, please open a GitHub issue. We would love to hear thoughts and feedback.

TorchArrow is a torch.Tensor-like Python DataFrame library for data preprocessing in deep learning. It supports multiple execution runtimes and Arrow as a common format.

It plans to provide:

  • Python Dataframe library implementing streaming-friendly Pandas subset
  • Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes
  • Zero copy for external readers via Arrow in-memory columnar format
  • High-performance CPU backend via Velox
  • GPU backend via libcudf
  • High-performance C++ UDF support with vectorization

Installation

Binaries

Coming soon!

From Source

If you are installing from source, you will need Python 3.8 or later and a C++17 compiler. Also, we highly recommend installing an Miniconda environment.

Get the TorchArrow Source

git clone --recursive https://github.com/facebookresearch/torcharrow
cd torcharrow
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive

Install Dependencies

On MacOS

HomeBrew is required to install development tools on MacOS.

# Install dependencies from Brew
brew install --formula ninja cmake ccache protobuf icu4c boost gflags glog libevent lz4 lzo snappy xz zstd

# Build and install other dependencies
scripts/build_mac_dep.sh ranges_v3 googletest fmt double_conversion folly re2

On Ubuntu (20.04 or later)

# Install dependencies from APT
apt install -y g++ cmake ccache ninja-build checkinstall \
    libssl-dev libboost-all-dev libdouble-conversion-dev libgoogle-glog-dev \
    libbz2-dev libgflags-dev libgtest-dev libgmock-dev libevent-dev libfmt-dev \
    libprotobuf-dev liblz4-dev libzstd-dev libre2-dev libsnappy-dev liblzo2-dev \
    protobuf-compiler
# Build and install Folly
scripts/install_ubuntu_folly.sh

Install TorchArrow

For local development, you can build with debug mode:

DEBUG=1 python setup.py develop

And run unit tests with

python -m unittest -v

To install TorchArrow with release mode (WARNING: may take very long to build):

python setup.py install

Documentation

This 10 minutes tutorial provides a short introduction to TorchArrow. More documents on advanced topics are coming soon!

Future Plans

We hope to sufficiently expand the library, harden APIs, and gather feedback to enable a beta release at the time of the PyTorch 1.11 release (early 2022).

License

TorchArrow is BSD licensed, as found in the LICENSE file.

Comments
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/95b09c7bad6baa93d8f6add4562dfe0cdc8c26cd

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 80
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/0cb50b9fdfccbb277e62e1c2541ae084b29d6080

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 79
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/e432b1df0be62f65e0ba00b4fb966605bcb1443e

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 58
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/7673b382d909add4738240ae0157f2d5cafcf546

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 56
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/5e37e22c974fcd9caceb3dd97a0e84386d188474

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 55
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/08be6833961213b6679a7a7707ca53d486ff84df

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 46
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/41971b30c1cdd9f984018d6a496bc3b99afc7b45

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 36
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/4a36551237993f519dbf5bbae70a4ac9a660bf0d

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 35
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/0e4d3a5efece59b7d6a6a8f23ed7e4668d078e2e

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 33
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/fb7b62bede0beb66cf87fe888d71a9c366fe5ed6

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 33
  • Automated submodule update: velox

    Automated submodule update: velox

    This is an automated pull request to update the first-party submodule for facebookincubator/velox.

    New submodule commit: https://github.com/facebookincubator/velox/commit/b32878fb54eefb01c0c577439d0d6d61644dcff9

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 33
  • Stable Release Roadmap

    Stable Release Roadmap

    Hello, I see that the development of the library has slowed down a bit, hence I would like to ask if there exists a roadmap for the first stable release or if there's any other plan for TorchArrow. Thank you very much for your work!

    opened by mbignotti 0
  • `from_arrow` with `List` columns

    `from_arrow` with `List` columns

    Summary: Adds some basic functionality to allow Arrow tables/arrays with List[primitive_type] columns to be converted to a ta.Dataframe.

    Implemented by converting the list column to a pylist and wrapping _from_pysequence. Not super efficient, but provides some functionality to unblock these columns.

    Tests: Modified previous test case that checked for unsupported type. python -m unittest -v

    ----------------------------------------------------------------------
    Ran 196 tests in 1.108s
    
    OK
    
    CLA Signed 
    opened by myzha0 0
  • Generalize Dispatcher class

    Generalize Dispatcher class

    Summary: Generalizing it for reusing it in different contexts. Also changing to global-instance-as-singleton pattern, so that we can instantiate instances for different use cases without worrying about mis-sharing the class variable calls

    Differential Revision: D40188963

    CLA Signed fb-exported 
    opened by OswinC 1
  • Support for arrays in torcharrow.from_arrow

    Support for arrays in torcharrow.from_arrow

    Hi guys! When trying to use ParquetDataFrameLoader I ran across a problem when trying to load parquet file if it has an array field. It looks like it comes down to torcharrow.from_arrow not supporting array columns. But it seems that torcharrow already has support for array columns. Are there any plans to implement it when loading from parquet files or are there any problems which stop this from being implemented?

    The error basically looks like this:

    NotImplementedError                       Traceback (most recent call last)
    Input In [25], in <cell line: 1>()
    ----> 1 next(iter(datapipe))
    
    File /opt/conda/lib/python3.8/site-packages/torch/utils/data/datapipes/_typing.py:514, in hook_iterator.<locals>.wrap_generator(*args, **kwargs)
        512         response = gen.send(None)
        513 else:
    --> 514     response = gen.send(None)
        516 while True:
        517     request = yield response
    
    File /opt/conda/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/combinatorics.py:127, in ShufflerIterDataPipe.__iter__(self)
        125 self._rng.seed(self._seed)
        126 self._seed = None
    --> 127 for x in self.datapipe:
        128     if len(self._buffer) == self.buffer_size:
        129         idx = self._rng.randint(0, len(self._buffer) - 1)
    
    File /opt/conda/lib/python3.8/site-packages/torch/utils/data/datapipes/_typing.py:514, in hook_iterator.<locals>.wrap_generator(*args, **kwargs)
        512         response = gen.send(None)
        513 else:
    --> 514     response = gen.send(None)
        516 while True:
        517     request = yield response
    
    File /opt/conda/lib/python3.8/site-packages/torchdata/datapipes/iter/util/dataframemaker.py:138, in ParquetDFLoaderIterDataPipe.__iter__(self)
        135 for i in range(num_row_groups):
        136     # TODO: More fine-grain control over the number of rows or row group per DataFrame
        137     row_group = parquet_file.read_row_group(i, columns=self.columns, use_threads=self.use_threads)
    --> 138     yield torcharrow.from_arrow(row_group, dtype=self.dtype)
    
    File /opt/conda/lib/python3.8/site-packages/torcharrow/interop.py:32, in from_arrow(data, dtype, device)
         30     return _from_arrow_array(data, dtype, device=device)
         31 elif isinstance(data, pa.Table):
    ---> 32     return _from_arrow_table(data, dtype, device=device)
         33 else:
         34     raise ValueError
    
    File /opt/conda/lib/python3.8/site-packages/torcharrow/interop_arrow.py:86, in _from_arrow_table(table, dtype, device)
         83     field = table.schema.field(i)
         85     assert len(table[i].chunks) == 1
    ---> 86     df_data[field.name] = _from_arrow_array(
         87         table[i].chunk(0),
         88         dtype=(
         89             # pyre-fixme[16]: `DType` has no attribute `get`.
         90             dtype.get(field.name)
         91             if dtype is not None
         92             else _arrowtype_to_dtype(field.type, field.nullable)
         93         ),
         94         device=device,
         95     )
         97 return dataframe(df_data, device=device)
    
    File /opt/conda/lib/python3.8/site-packages/torcharrow/interop_arrow.py:37, in _from_arrow_array(array, dtype, device)
         28 assert isinstance(array, pa.Array)
         30 # Using the most narrow type we can, we (i) don't restrict in any
         31 # way where it can be used (since we can pass a narrower typed
         32 # non-null column to a function expecting a nullable type, but not
       (...)
         35 # increase the amount of places we can use the from_arrow result
         36 # pyre-fixme[16]: `Array` has no attribute `type`.
    ---> 37 dtype_from_arrowtype = _arrowtype_to_dtype(array.type, array.null_count > 0)
         38 if dtype and (
         39     dt.get_underlying_dtype(dtype) != dt.get_underlying_dtype(dtype_from_arrowtype)
         40 ):
         41     raise NotImplementedError("Type casting is not supported")
    
    File /opt/conda/lib/python3.8/site-packages/torcharrow/_interop.py:205, in _arrowtype_to_dtype(t, nullable)
        199 if pa.types.is_struct(t):
        200     return dt.Struct(
        201         # pyre-fixme[16]: `DataType` has no attribute `__iter__`.
        202         [dt.Field(f.name, _arrowtype_to_dtype(f.type, f.nullable)) for f in t],
        203         nullable,
        204     )
    --> 205 raise NotImplementedError(f"Unsupported Arrow type: {str(t)}")
    
    NotImplementedError: Unsupported Arrow type: list<element: float>
    This exception is thrown by __iter__ of ParquetDFLoaderIterDataPipe()
    
    opened by grapefroot 2
Releases(v0.1.0)
  • v0.1.0(Jul 13, 2022)

    We are excited to release the very first Beta version of TorchArrow! TorchArrow is a machine learning preprocessing library over batch data, providing performant and Pandas-style easy-to-use API for model development.

    Highlights

    TorchArrow provides a Python DataFrame that allows extensible UDFs with Velox, with the following features:

    • Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes
    • Zero copy for external readers via Arrow in-memory columnar format
    • Multiple execution runtimes support:
      • High-performance CPU backend via Velox
      • (Future Work) GPU backend via libcudf
    • High-performance C++ UDF support with vectorization

    Installation

    In this release we are supporting install via PYPI: pip install torcharrow.

    Documentation

    You can find the API documentation here.

    This 10 minutes tutorial provides a short introduction to TorchArrow, and you can also try it in this Colab.

    Examples

    You can find the example about integrating a TorchRec based training loop utilizing TorchArrow's on-the-fly preprocessing here. More examples are coming soon!

    Future Plans

    We hope to continue to expand the library, harden API, and gather feedback to enable future releases. Stay tuned!

    Beta Usage Note

    TorchArrow is currently in the Beta stage and does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

    Source code(tar.gz)
    Source code(zip)
Owner
Facebook Research
Facebook Research
Joint detection and tracking model named DEFT, or ``Detection Embeddings for Tracking.

DEFT: Detection Embeddings for Tracking DEFT: Detection Embeddings for Tracking, Mohamed Chaabane, Peter Zhang, J. Ross Beveridge, Stephen O'Hara

Mohamed Chaabane 253 Dec 18, 2022
[ICCV 2021 (oral)] Planar Surface Reconstruction from Sparse Views

Planar Surface Reconstruction From Sparse Views Linyi Jin, Shengyi Qian, Andrew Owens, David F. Fouhey University of Michigan ICCV 2021 (Oral) This re

Linyi Jin 89 Jan 05, 2023
Code to compute permutation and drop-column importances in Python scikit-learn models

Feature importances for scikit-learn machine learning models By Terence Parr and Kerem Turgutlu. See Explained.ai for more stuff. The scikit-learn Ran

Terence Parr 537 Dec 31, 2022
Official PyTorch implementation of "Adversarial Reciprocal Points Learning for Open Set Recognition"

Adversarial Reciprocal Points Learning for Open Set Recognition Official PyTorch implementation of "Adversarial Reciprocal Points Learning for Open Se

Guangyao Chen 78 Dec 28, 2022
Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition This repository contains code for the CVPR2021 paper "Patch-NetV

QVPR 368 Jan 06, 2023
Cross-Task Consistency Learning Framework for Multi-Task Learning

Cross-Task Consistency Learning Framework for Multi-Task Learning Tested on numpy(v1.19.1) opencv-python(v4.4.0.42) torch(v1.7.0) torchvision(v0.8.0)

Aki Nakano 2 Jan 08, 2022
Python inverse kinematics for your robot model based on Pinocchio.

Python inverse kinematics for your robot model based on Pinocchio.

Stéphane Caron 50 Dec 22, 2022
A Dataset of Python Challenges for AI Research

Python Programming Puzzles (P3) This repo contains a dataset of python programming puzzles which can be used to teach and evaluate an AI's programming

Microsoft 850 Dec 24, 2022
A PyTorch Implementation of ViT (Vision Transformer)

ViT - Vision Transformer This is an implementation of ViT - Vision Transformer by Google Research Team through the paper "An Image is Worth 16x16 Word

Quan Nguyen 7 May 11, 2022
LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image.

This project is based on ultralytics/yolov3. LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image. The related paper is avai

26 Dec 13, 2022
Deep learning model for EEG artifact removal

DeepSeparator Introduction Electroencephalogram (EEG) recordings are often contaminated with artifacts. Various methods have been developed to elimina

23 Dec 21, 2022
Architecture Patterns with Python (TDD, DDD, EDM)

architecture-traning Architecture Patterns with Python (TDD, DDD, EDM) Chapter 5. 높은 기어비와 낮은 기어비의 TDD 5.2 도메인 계층 테스트를 서비스 계층으로 옮겨야 하는가? 도메인 계층 테스트 def

minsung sim 2 Mar 04, 2022
Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models

Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models This repo contains a barebones implementation for the atta

16 Dec 04, 2022
Reproduces ResNet-V3 with pytorch

ResNeXt.pytorch Reproduces ResNet-V3 (Aggregated Residual Transformations for Deep Neural Networks) with pytorch. Tried on pytorch 1.6 Trains on Cifar

Pau Rodriguez 481 Dec 23, 2022
The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

1 Jan 27, 2022
A framework for using LSTMs to detect anomalies in multivariate time series data. Includes spacecraft anomaly data and experiments from the Mars Science Laboratory and SMAP missions.

Telemanom (v2.0) v2.0 updates: Vectorized operations via numpy Object-oriented restructure, improved organization Merge branches into single branch fo

Kyle Hundman 844 Dec 28, 2022
Show-attend-and-tell - TensorFlow Implementation of "Show, Attend and Tell"

Show, Attend and Tell Update (December 2, 2016) TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attent

Yunjey Choi 902 Nov 29, 2022
Project repo for Learning Category-Specific Mesh Reconstruction from Image Collections

Learning Category-Specific Mesh Reconstruction from Image Collections Angjoo Kanazawa*, Shubham Tulsiani*, Alexei A. Efros, Jitendra Malik University

438 Dec 22, 2022
Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Database

Python cx_Oracle Notebooks, 2022 The repository contains Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Da

Christopher Jones 13 Dec 15, 2022
A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

Stylegan2-Ada-Google-Colab-Starter-Notebook A no thrills colab notebook for training Stylegan2-ada on colab. transfer learning onto your own dataset h

Harnick Khera 66 Dec 16, 2022