A rule learning algorithm for the deduction of syndrome definitions from time series data.

Overview

README

This project provides a rule learning algorithm for the deduction of syndrome definitions from time series data. Large parts of the algorithm are based on "BOOMER".

Features

The algorithm that is provided by this project currently supports the following functionalities for learning descriptive rules:

  • The quality of rules is assessed by comparing the predictions of the current model to the ground truth in terms of the Pearson correlation coefficient.
  • When learning a new rule, random samples of the features may be used.
  • Hyper-parameters that provide control over the specificity/generality of rules are available.
  • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
  • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.

In addition, the following features that may speed up training or reduce the memory footprint are currently implemented:

  • Dense or sparse feature matrices can be used for training. The use of sparse matrices may speed-up training significantly on some data sets.
  • Multi-threading can be used to parallelize the evaluation of a rule's potential refinements across multiple CPU cores.

Project structure

|-- cpp                     Contains the implementation of core algorithms in C++
    |-- subprojects
        |-- common          Contains implementations that all algorithms have in common
        |-- tsa             Contains implementations for time series analysis
    |-- ...
|-- python                  Contains Python code for running experiments
    |-- rl
        |-- common          Contains Python code that is needed to run any kind of algorithms
            |-- cython      Contains commonly used Cython wrappers
            |-- ...
        |-- tsa             Contains Python code for time series analysis
            |-- cython      Contains time series-specific Cython wrappers
            |-- ...
        |-- testbed         Contains useful functionality for running experiments
            |-- ...
    |-- main.py             Can be used to start an experiment
    |-- ...
|-- Makefile                Makefile for compilation
|-- ...

Project setup

The algorithm provided by this project is implemented in C++. In addition, a Python wrapper that implements the scikit-learn API is available. To be able to integrate the underlying C++ implementation with Python, Cython is used.

The C++ implementation, as well as the Cython wrappers, must be compiled in order to be able to run the provided algorithm. To facilitate compilation, this project comes with a Makefile that automatically executes the necessary steps.

At first, a virtual Python environment can be created via the following command:

make venv

As a prerequisite, Python 3.7 (or a more recent version) must be available on the host system. All compile-time dependencies (numpy, scipy, Cython, meson and ninja) that are required for building the project will automatically be installed into the virtual environment. As a result of executing the above command, a subdirectory venv should have been created within the project's root directory.

Afterwards, the compilation can be started by executing the following command:

make compile

Finally, the library must be installed into the virtual environment, together with all of its runtime dependencies (e.g. scikit-learn, a full list can be found in setup.py). For this purpose, the project's Makefile provides the following command:

make install

Whenever any C++ or Cython source files have been modified, they must be recompiled by running the command make compile again! If compilation files do already exist, only the modified files will be recompiled.

Cleanup: To get rid of any compilation files, as well as of the virtual environment, the following command can be used:

make clean

For more fine-grained control, the command make clean_venv (for deleting the virtual environment) or make clean_compile (for deleting the compiled files) can be used. If only the compiled Cython files should be removed, the command make clean_cython can be used. Accordingly, the command make clean_cpp removes the compiled C++ files.

Parameters

The file python/main.py allows to run experiments on a specific data set using different configurations of the learning algorithm. The implementation takes care of writing the experimental results into .csv files and the learned model can (optionally) be stored on disk to reuse it later.

In order to run an experiment, the following command line arguments must be provided (most of them are optional):

Parameter Optional? Default Description
--data-dir No None The path of the directory where the data sets are located.
--temp-dir No None The path of the directory where temporary files should be saved.
--dataset No None The name of the .csv files that store the raw data (without suffix).
--feature-definition No None The name of the .txt file that specifies the names of the features to be used (without suffix).
--from-year No None The first year (inclusive) that should be taken into account.
--to-year No None The last year (inclusive) that should be taken into account.
--from-week Yes -1 The first week (inclusive) of the first year that should be taken into account or -1, if all weeks of that year should be used.
--to-week Yes -1 The last week (inclusive) of the last year that should be taken into account or -1, if all weeks of that year should be used.
--count-file-name Yes None The name of the file that stores the number of cases that correspond to individual weeks (without suffix). If not specified, the results from appending "_counts" to the dataset name.
--one-hot-encoding Yes False True, if one-hot-encoding should be used for nominal attributes, False otherwise.
--output-dir Yes None The path of the directory into which the experimental results (.csv files) should be written.
--print-rules Yes True True, if the induced rules should be printed on the console, False otherwise.
--store-rules Yes True True, if the induced rules should be stored as a .txt file, False otherwise. Does only have an effect if the parameter --output-dir is specified.
--print-options Yes {} A dictionary that specifies additional options to be used for printing or storing rules, if the parameter --print-rules and/or --store-rules is set to True, e.g. {'print_feature_names':True,'print_label_names':True,'print_nominal_values':True}.
--store-predictions Yes True True, if the predictions for the training data should be stored as a .csv file, False otherwise. Does only have an effect if the parameter --output-dir is specified.
--model-dir Yes None The path of the directory where models (.model files) are located.
--max-rules Yes 50 The maximum number of rules to be induced or -1, if the number of rules should not be restricted.
--time-limit Yes -1 The duration in seconds after which the induction of rules should be canceled or -1, if no time limit should be used.
--feature-sub-sampling Yes None The name of the strategy to be used for feature sub-sampling. Must be random-feature-selection or None. Additional arguments may be provided as a dictionary, e.g. random_feature-selection{'sample_size':0.5}.
--min-support Yes 0.0001 The percentage of training examples that must be covered by a rule. Must be greater than 0 and smaller than 1.
--max-conditions Yes -1 The maximum number of conditions to be included in a rule's body. Must be at least 1 or -1, if the number of conditions should not be restricted.
--random-state Yes 1 The seed to the be used by random number generators.
--feature-format Yes auto The format to be used for the feature matrix. Must be sparse, if a sparse matrix should be used, dense, if a dense matrix should be used, or auto, if the format should be chosen automatically.
--num-threads-refinement Yes 1 The number of threads to be used to search for potential refinements of rules. Must be at least 1 or -1, if the number of cores that are available on the machine should be used.
--log-level Yes info The log level to be used. Must be debug, info, warn, warning, error, critical, fatal or notset.

Example and data format

In the following, we give a more detailed description of the data that must be provided to the algorithm. All input files must use UTF-8 encoding and they must be available in a single directory. The path of the directory must be specified via the parameter --data-dir. The following files must be included in the directory:

  • A .csv file that stores the raw training data (see data/example.csv for an example). Each row (separated by line breaks) must correspond to an individual instance and the columns (separated by commas) must correspond to the available features. The names of the columns/features must be given as the first row. The names of columns can be arbitrary, but there must be a column named "week" that associates each instance with a corresponding year and week (using the format year-month, e.g. 2019-2).
  • A .csv file that specifies the number of cases that correspond to individual weeks (see data/example_counts.csv for an example). The file must consist of three columns, year,week,cases, separated by commas. The names of columns must be given as the first row. Each of the other rows (separated by line breaks) assigns a specific number of cases to a certain week of a year (all values must be positive integers). For each combination of year and week that occurs in the column "week" of the first .csv file, the number of cases must be specified in this second .csv file.
  • A .txt file that specifies the names of the features that should be taken into account (see data/features.txt for an example). Each feature name must be given as a new line. For each feature that is specified in the text file, a column with the same name must exist in the first .csv file.

The parameter --dataset is used to identify the .csv files that should be used by the algorithm. Its value must correspond to the name of the first .csv file mentioned above, omitting the file's suffix (e.g. example if the file's name is example.csv). The second .csv file must be named accordingly by appending the suffix _counts to the name of the first file (e.g. example_counts.csv). The parameter --feature-definition is used to specify the name of the text file that stores the names of relevant features. The given value must correspond to the name of the text file, again omitting the file's suffix (e.g. features, if the file's name is features.txt).

In the following, the command for running an experiment, including all mandatory parameters, can be seen:

venv/bin/python3 python/main.py --data-dir /path/to/data/ --temp-dir /path/to/temp/ --dataset example --feature-definition features --from-year 2018 --to-year 2019

When running the program for the first time, the .csv files that are located in the specified data directory will be loaded. The data will be filtered according to the parameters --from-year and --to-year, such that only instances that belong to the specified timespan are retained. Furthermore, all columns that are missing from the supplied text file will be removed. Finally, the data is converted into the format that is required for learning a rule model. This results in two files (an .arff file and a .xml file) that are written to the directory that is specified via the parameter --temp-dir. The resulting files are named according to the following scheme: <dataset>_<feature-definition>_<from-year>-<to-year> (e.g., example_features_2018-2019.) When running the program multiple times, it will check if the files do already exist. If this is the case, the preprocessing step will be skipped and the available files will be used as they are.

You might also like...
Rule-based Customer Segmentation
Rule-based Customer Segmentation

Rule-based Customer Segmentation Business Problem A game company wants to create level-based new customer definitions (personas) by using some feature

Rule based classification A hotel s customers dataset

Rule-based-classification-A-hotel-s-customers-dataset- Aim: Categorize new customers by segment and predict how much revenue they can generate This re

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)
PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

PyExplainer PyExplainer is a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of J

Continuous Security Group Rule Change Detection & Response at scale
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

A rule-based log analyzer & filter

Flog 一个根据规则集来处理文本日志的工具。 前言 在日常开发过程中,由于缺乏必要的日志规范,导致很多人乱打一通,一个日志文件夹解压缩后往往有几十万行。 日志泛滥会导致信息密度骤减,给排查问题带来了不小的麻烦。 以前都是用grep之类的工具先挑选出有用的,再逐条进行排查,费时费力。在忍无可忍之后决

The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting".

IGMTF The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting". Requirements The framework

TAug :: Time Series Data Augmentation using Deep Generative Models

TAug :: Time Series Data Augmentation using Deep Generative Models Note!!! The package is under development so be careful for using in production! Fea

A real world application of a Recurrent Neural Network on a binary classification of time series data
A real world application of a Recurrent Neural Network on a binary classification of time series data

What is this This is a real world application of a Recurrent Neural Network on a binary classification of time series data. This project includes data

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

Releases(0.1.0)
  • 0.1.0(Sep 24, 2021)

    The first release of the algorithm. It supports the following functionalities for learning descriptive rules:

    • The quality of rules is assessed by comparing the predictions of the current model to the ground truth in terms of the Pearson correlation coefficient.
    • When learning a new rule, random samples of the features may be used.
    • Hyper-parameters that provide control over the specificity/generality of rules are available.
    • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
    • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.
    Source code(tar.gz)
    Source code(zip)
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Keras implementation of the GNM model in paper ’Graph-Based Semi-Supervised Learning with Nonignorable Nonresponses‘

Graph-based joint model with Nonignorable Missingness (GNM) This is a Keras implementation of the GNM model in paper ’Graph-Based Semi-Supervised Lear

Fan Zhou 2 Apr 17, 2022
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 21.1k Jan 01, 2023
Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

4 Jul 12, 2021
Efficient electromagnetic solver based on rigorous coupled-wave analysis for 3D and 2D multi-layered structures with in-plane periodicity

Efficient electromagnetic solver based on rigorous coupled-wave analysis for 3D and 2D multi-layered structures with in-plane periodicity, such as gratings, photonic-crystal slabs, metasurfaces, surf

Alex Song 17 Dec 19, 2022
:boar: :bear: Deep Learning based Python Library for Stock Market Prediction and Modelling

bulbea "Deep Learning based Python Library for Stock Market Prediction and Modelling." Table of Contents Installation Usage Documentation Dependencies

Achilles Rasquinha 1.8k Jan 05, 2023
Universal Adversarial Examples in Remote Sensing: Methodology and Benchmark

Universal Adversarial Examples in Remote Sensing: Methodology and Benchmark Yong

19 Dec 17, 2022
QuALITY: Question Answering with Long Input Texts, Yes!

QuALITY: Question Answering with Long Input Texts, Yes! Authors: Richard Yuanzhe Pang,* Alicia Parrish,* Nitish Joshi,* Nikita Nangia, Jason Phang, An

ML² AT CILVR 61 Jan 02, 2023
Official repository for "Orthogonal Projection Loss" (ICCV'21)

Orthogonal Projection Loss (ICCV'21) Kanchana Ranasinghe, Muzammal Naseer, Munawar Hayat, Salman Khan, & Fahad Shahbaz Khan Paper Link | Project Page

Kanchana Ranasinghe 83 Dec 26, 2022
Official PyTorch implementation of MAAD: A Model and Dataset for Attended Awareness

MAAD: A Model for Attended Awareness in Driving Install // Datasets // Training // Experiments // Analysis // License Official PyTorch implementation

7 Oct 16, 2022
This is the code for HOI Transformer

HOI Transformer Code for CVPR 2021 accepted paper End-to-End Human Object Interaction Detection with HOI Transformer. Reproduction We recomend you to

BigBangEpoch 124 Dec 29, 2022
Offcial implementation of "A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction, ICCV-2021".

HF2-VAD Offcial implementation of "A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Predictio

76 Dec 21, 2022
Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

Phil Wang 207 Dec 23, 2022
Intelligent Video Analytics toolkit based on different inference backends.

English | 中文 OpenIVA OpenIVA is an end-to-end intelligent video analytics development toolkit based on different inference backends, designed to help

Quantum Liu 15 Oct 27, 2022
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

NÜWA - Pytorch (wip) Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be popul

Phil Wang 463 Dec 28, 2022
Pytorch implementation of the paper "Optimization as a Model for Few-Shot Learning"

Optimization as a Model for Few-Shot Learning This repo provides a Pytorch implementation for the Optimization as a Model for Few-Shot Learning paper.

Albert Berenguel Centeno 238 Jan 04, 2023
[ECCVW2020] Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DiMP)

Feel free to visit my homepage Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DIMP) [ECCVW2020 paper] Presentation

Seokeon Choi 35 Oct 26, 2022
PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition The unofficial code of CDistNet. Now, we ha

25 Jul 20, 2022
code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning Overview This code is for paper: Not All Unlabeled Data are Equa

Jason Ren 22 Nov 23, 2022
Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation

Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation This is the implementation of the approach describ

Taosha Fan 47 Nov 15, 2022