lightweight, fast and robust columnar dataframe for data analytics with online update

Last update: May 19, 2022

Related tags

Overview

streamdf

Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competition.

Key Features

Fast and robust insertion
- The insertion of row can be performed with amortized constant time (much faster than np.append)
- Automatically falls back to the default value when an abnormal value is inserted
Time-travel
- Get the past state of the data as a slice of the original dataframe without copying
Null/empty-safe aggregations
- Provides a set of aggregation methods that can be safely called when an element has nan or is empty.
Columnar layout
- Internal data is stored in a simple columnar format, which is easier to use for analysis than numpy's structured array

Example

import pandas as pd
from streamdf import StreamDf

df = pd.read_csv('test.csv')
sdf = StreamDf.from_pandas(df)

# extend
sdf.extend({
    'x': 1,
    'y': 2
})

assert len(sdf) == len(df) + 1

# access
print(sdf['x'])

# aggregate
sdf.last_value('x')

import numpy as np
from streamdf import StreamDf

sdf = StreamDf.empty({'x': np.int32, 'time': 'datetime64[D]'}, 'time')

sdf.extend({'x': 1, 'time': np.datetime64('2018-01-01')})
sdf.extend({'x': 5, 'time': np.datetime64('2018-02-01')})
sdf.extend({'x': 3, 'time': np.datetime64('2018-02-03')})

assert len(sdf) == 3

# Time travel (zero copy)
sliced = sdf.slice_until(np.datetime64('2018-02-02'))

assert len(sliced) == 2

lightweight, fast and robust columnar dataframe for data analytics with online update

Related tags

Overview

streamdf

Key Features

Example

Owner

Associated Repository for "Translation between Molecules and Natural Language"

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

This repository is home to the Optimus data transformation plugins for various data processing needs.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

A website which allows you to play with the GPT-2 transformer

ASCEND Chinese-English code-switching dataset

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Awesome-NLP-Research (ANLP)

PG-19 Language Modelling Benchmark

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Conditional probing: measuring usable information beyond a baseline

The PyTorch based implementation of continuous integrate-and-fire (CIF) module.