A practical ML pipeline for data labeling with experiment tracking using DVC.

Overview

Auto Label Pipeline

A practical ML pipeline for data labeling with experiment tracking using DVC

Goals:

  • Demonstrate reproducible ML
  • Use DVC to build a pipeline and track experiments
  • Automatically clean noisy data labels using Cleanlab cross validation
  • Determine which FastText subword embedding performs better for semi-supervised cluster classification
  • Determine optimal hyperparameters through experiment tracking
  • Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

asciicast

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

  • text_data - the item that is to be labeled (single word or short group of words)
  • class_type - the class label
  • context - any text that surrounds the text_data field in situ, or defines the text_data item in other words.
  • count - the number of occurrences of this label; how common it appears in the existing data.

Data Output format

  • (same parameters as the data input plus)
  • date_updated - when the label was updated
  • previous_class_type - the previous class_type label
  • mislabeled_rank - records how low the confidence was prior to a re-label

The Pipeline

  • Fetch
  • Prepare
  • Train
  • Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

  1. Drop your own CSV files into the data/raw directory
  2. Run the pipeline
  3. Tune settings, embeddings, etc, until no longer amused
  4. Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py

Setup

Create environment

conda create --name auto-label-pipeline python=3.9

conda activate auto-label-pipeline

Install requirements

pip install -r requirements.txt

If you're going to modify the source, also install the requirements-dev.txt file


Reproduce the pipeline results locally

dvc repro

View Metrics

dvc metrics show

See also: DVC metrics

Working with Experiments

To see your local experiments:

dvc exp show

Experiments that have been turned into a branches can be referenced directly in commands:

dvc exp diff svc_linear_ex svc_rbf_ex

e.g. to compare experiments:

dvc exp diff [experiment branch name] [experiment branch 2 name]

e.g.:

dvc exp diff svc_linear_ex svc_rbf_ex

dvc exp diff svc_poly_ex svc_rbf_ex

To create an experiment by changing a parameter:

dvc exp run --set-param train.split=0.9 --name my_split_ex

(When promoting an experiment to a branch, DVC does not switch into the branch.)

To save and share your experiment in a branch:

dvc exp branch my_split_ex my_split_ex_branch

See also: DVC Experiments

View plots

Initial Confusion matrix:

dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion

Confusion matrix after relabeling:

dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion

See also: DVC plots


Conclusions

  • For relabeling and cleaning, it's important to have more than two labels, and to specifying an UNK label for: unknown; labels spanning multiple groups; or low confidence support.
  • Standardizing the input data formats allow users to flexibly use many different data sources.
  • Language detection is an important part of data cleaning, however problematic because:
    • Modern languages sometimes "borrow" words from other languages (but not just any words!)
    • Language detection models perform inference poorly with limited data, especially just a single word.
    • Normalization utilities, such as unidecode aren't helpful; (the wrong word in more readable letters is still the wrong word).
  • Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.

Recommended readings:

  • Confident Learning: Estimating Uncertainty in Dataset Labels by Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang, 31 Oct 2019, arxiv
  • A Simple but tough-to-beat baseline for sentence embeddings by Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017, paper
  • Support Vector Clustering by Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik, November 2001 Journal of Machine Learning Research 2 (12):125-137, DOI:10.1162/15324430260185565, paper
  • SVM clustering by Winters-Hilt, S., Merat, S. BMC Bioinformatics 8, S18 (2007). link, paper

Note: this repo layout borrows heavily from the Cookie Cutter Data Science Layout If you're not familiar with it, please check it out.

Owner
Todd Cook
Software craftsman
Todd Cook
Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

Phil Wang 109 Dec 28, 2022
PyTorch implementation of InstaGAN: Instance-aware Image-to-Image Translation

InstaGAN: Instance-aware Image-to-Image Translation Warning: This repo contains a model which has potential ethical concerns. Remark that the task of

Sangwoo Mo 827 Dec 29, 2022
Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators This is our Pytorch implementation for t

RUCAIBox 12 Jul 22, 2022
My coursework for Machine Learning (2021 Spring) at National Taiwan University (NTU)

Machine Learning 2021 Machine Learning (NTU EE 5184, Spring 2021) Instructor: Hung-yi Lee Course Website : (https://speech.ee.ntu.edu.tw/~hylee/ml/202

100 Dec 26, 2022
[제 13회 투빅스 컨퍼런스] OK Mugle! - 장르부터 멜로디까지, Content-based Music Recommendation

Ok Mugle! 🎵 장르부터 멜로디까지, Content-based Music Recommendation 'Ok Mugle!'은 제13회 투빅스 컨퍼런스(2022.01.15)에서 진행한 음악 추천 프로젝트입니다. Description 📖 본 프로젝트에서는 Kakao

SeongBeomLEE 5 Oct 09, 2022
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

4.2k Jan 01, 2023
Code for Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games How to run our algorithm? Create the new environment using: conda

MARL @ SJTU 8 Dec 27, 2022
I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

Sehoon Kim 139 Dec 27, 2022
Search Youtube Video and Get Video info

PyYouTube Get Video Data from YouTube link Installation pip install PyYouTube How to use it ? Get Videos Data from pyyoutube import Data yt = Data("ht

lokaman chendekar 35 Nov 25, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 03, 2023
A state-of-the-art semi-supervised method for image recognition

Mean teachers are better role models Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post By Antti Tarvainen, Harri Valpola (The

Curious AI 1.4k Jan 06, 2023
Spectral Tensor Train Parameterization of Deep Learning Layers

Spectral Tensor Train Parameterization of Deep Learning Layers This repository is the official implementation of our AISTATS 2021 paper titled "Spectr

Anton Obukhov 12 Oct 23, 2022
code for our ECCV-2020 paper: Self-supervised Video Representation Learning by Pace Prediction

Video_Pace This repository contains the code for the following paper: Jiangliu Wang, Jianbo Jiao and Yunhui Liu, "Self-Supervised Video Representation

Jiangliu Wang 95 Dec 14, 2022
Keras-1D-NN-Classifier

Keras-1D-NN-Classifier This code is based on the reference codes linked below. reference 1, reference 2 This code is for 1-D array data classification

Jae-Hoon Shim 6 May 18, 2021
[ACM MM2021] MGH: Metadata Guided Hypergraph Modeling for Unsupervised Person Re-identification

Introduction This project is developed based on FastReID, which is an ongoing ReID project. Projects BUC In projects/BUC, we implement AAAI 2019 paper

WuYiming 7 Apr 13, 2022
Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore

[AI6122] Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore. The repository corresponds to the AI6122 of Semester 1, AY2021-2022, starting from 08/2021. The instructor of this course

HT. Li 5 Sep 12, 2022
Official code of "R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network."

R2RNet Official code of "R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network." Jiang Hai, Zhu Xuan, Ren Yang, Yutong Hao, Fengzhu

77 Dec 24, 2022
Show-attend-and-tell - TensorFlow Implementation of "Show, Attend and Tell"

Show, Attend and Tell Update (December 2, 2016) TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attent

Yunjey Choi 902 Nov 29, 2022
Keywords : Streamlit, BertTokenizer, BertForMaskedLM, Pytorch

Next Word Prediction Keywords : Streamlit, BertTokenizer, BertForMaskedLM, Pytorch 🎬 Project Demo ✔ Application is hosted on Streamlit. You can see t

Vivek7 3 Aug 26, 2022
GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks This repository implements a capsule model Inten

Joel Huang 15 Dec 24, 2022