PyEmits, a python package for easy manipulation in time-series data.

Related tags

Data AnalysisPyEmits
Overview

Project Icon

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life.

  • Engineering
  • FSI industry (Financial Services Industry)
  • FMCG (Fast Moving Consumer Good)

Data scientist's work consists of:

  • forecasting
  • prediction/simulation
  • data prepration
  • cleansing
  • anomaly detection
  • descriptive data analysis/exploratory data analysis

each new business unit shall build the following wheels again and again

  1. data pipeline
    1. extraction
    2. transformation
      1. cleansing
      2. feature engineering
      3. remove outliers
      4. AI landing for prediction, forecasting
    3. write it back to database
  2. ml framework
    1. multiple model training
    2. multiple model prediction
    3. kfold validation
    4. anomaly detection
    5. forecasting
    6. deep learning model in easy way
    7. ensemble modelling
  3. exploratory data analysis
    1. descriptive data analysis
    2. ...

That's why I create this project, also for fun. haha

This project is under active development, free to use (Apache 2.0) I am happy to see anyone can contribute for more advancement on features

Install

pip install pyemits

Features highlight

  1. Easy training
import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel

X = np.random.randint(1, 100, size=(1000, 10))
y = np.random.randint(1, 100, size=(1000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()
  1. Accept neural network as model
import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

keras_lstm_model = KerasWrapper.from_simple_lstm_model((10, 10), 4)
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

also keep flexibility on customized model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

model = Sequential()
model.add(LSTM(128,
               activation='softmax',
               input_shape=(10, 10),
               ))
model.add(Dropout(0.1))
model.add(Dense(4))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

keras_lstm_model = KerasWrapper(model, nickname='LSTM')
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

or attach it in algo config

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
from pyemits.common.config_model import KerasSequentialConfig

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

keras_lstm_model = KerasWrapper(nickname='LSTM')
config = KerasSequentialConfig(layer=[LSTM(128,
                                           activation='softmax',
                                           input_shape=(10, 10),
                                           ),
                                      Dropout(0.1),
                                      Dense(4)],
                               compile=dict(loss='mse', optimizer='adam', metrics=['mse']))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model],
                     [config],
                     raw_data_model, 
                     {'fit_config' : [dict(epochs=10, batch_size=32)]})
trainer.fit()

PyTorch, MXNet under development you can leave me a message if you want to contribute

  1. MultiOutput training
import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, MultiOutputRegTrainer
from pyemits.core.preprocessing.splitting import SlidingWindowSplitter

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

# when use auto-regressive like MultiOutput, pls set ravel = True
# ravel = False, when you are using LSTM which support multiple dimension
splitter = SlidingWindowSplitter(24,24,ravel=True)
X, y = splitter.split(X, y)

raw_data_model = RegressionDataModel(X,y)
trainer = MultiOutputRegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()
  1. Parallel training
    • provide fast training using parallel job
    • use RegTrainer as base, but add Parallel running
import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, ParallelRegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = ParallelRegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

or you can use RegTrainer for multiple model, but it is not in Parallel job

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()
  1. KFold training
    • KFoldConfig is global config, will apply to all
import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  KFoldCVTrainer
from pyemits.common.config_model import KFoldConfig

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = KFoldCVTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model, {'kfold_config':KFoldConfig(n_splits=10)})
trainer.fit()
  1. Easy prediction
import numpy as np 
from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer
from pyemits.core.ml.regression.predictor import RegPredictor

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

predictor = RegPredictor(trainer.clf_models, 'RegTrainer')
predictor.predict(RegressionDataModel(X))
  1. Forecast at scale
  2. Data Model
from pyemits.common.data_model import RegressionDataModel
import numpy as np
X = np.random.randint(1, 100, size=(1000,10,10))
y = np.random.randint(1, 100, size=(1000, 1))

data_model = RegressionDataModel(X, y)

data_model._update_variable('X_shape', (1000,10,10))
data_model.X_shape

data_model.add_meta_data('X_shape', (1000,10,10))
data_model.meta_data
  1. Anomaly detection (under development)
  2. Evaluation (under development)
    • see module: evaluation
    • backtesting
    • model evaluation
  3. Ensemble (under development)
    • blending
    • stacking
    • voting
    • by combo package
      • moa
      • aom
      • average
      • median
      • maximization
  4. IO
    • db connection
    • local
  5. dashboard ???
  6. other miscellaneous feature
    • continuous evaluation
    • aggregation
    • dimensional reduction
    • data profile (intensive data overview)
  7. to be confirmed

References

the following libraries gave me some idea/insight

  1. greykit
    1. changepoint detection
    2. model summary
    3. seaonality
  2. pytorch-forecasting
  3. darts
  4. pyaf
  5. orbit
  6. kats/prophets by facebook
  7. sktime
  8. gluon ts
  9. tslearn
  10. pyts
  11. luminaries
  12. tods
  13. autots
  14. pyodds
  15. scikit-hts
You might also like...
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

A powerful data analysis package based on mathematical step functions.  Strongly aligned with pandas.
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

small package with utility functions for analyzing (fly) calcium imaging data
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

 Integrate bus data from a variety of sources (batch processing and real time processing).
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

Fast, flexible and easy to use probabilistic modelling in Python.
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Releases(v0.1.2)
Owner
Thompson
Data Analyst, Scientist, Engineer, Research and Development
Thompson
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

Alex Carter 12 Jan 05, 2023
Fitting thermodynamic models with pycalphad

ESPEI ESPEI, or Extensible Self-optimizing Phase Equilibria Infrastructure, is a tool for thermodynamic database development within the CALPHAD method

Phases Research Lab 42 Sep 12, 2022
Tkinter Izhikevich Neuron Model With Python

TKINTER IZHIKEVICH NEURON MODEL WITH PYTHON Hodgkin-Huxley Model It is a mathematical model for the generation and transmission of action potentials i

Rabia KOÇ 8 Jul 16, 2022
Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day. Correlate the market activity with the Apple Keynote presentations.

2 Jan 04, 2022
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
Yet Another Workflow Parser for SecurityHub

YAWPS Yet Another Workflow Parser for SecurityHub "Screaming pepper" by Rum Bucolic Ape is licensed with CC BY-ND 2.0. To view a copy of this license,

myoung34 8 Dec 22, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

CRISPRanalysis InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family. In this work, we present a workflow

2 Jan 31, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021