Data from "Datamodels: Predicting Predictions with Training Data"

Overview

Data from "Datamodels: Predicting Predictions with Training Data"

Here we provide the data used in the paper "Datamodels: Predicting Predictions with Training Data" (arXiv, Blog).

Note that all of the data below is stored on Amazon S3 using the “requester pays” option to avoid a blowup in our data transfer costs (we put estimated AWS costs below)---if you are on a budget and do not mind waiting a bit longer, please contact us at [email protected] and we can try to arrange a free (but slower) transfer.

Citation

To cite this data, please use the following BibTeX entry:

@inproceedings{ilyas2022datamodels,
  title = {Datamodels: Predicting Predictions from Training Data},
  author = {Andrew Ilyas and Sung Min Park and Logan Engstrom and Guillaume Leclerc and Aleksander Madry},
  booktitle = {ArXiv preprint arXiv:2202.00622},
  year = {2022}
}

Overview

We provide the data used in our paper to analyze two image classification datasets: CIFAR-10 and (a modified version of) FMoW.

For each dataset, the data consists of two parts:

  1. Training data for datamodeling, which consists of:
    • Training subsets or "training masks", which are the independent variables of the regression tasks; and
    • Model outputs (correct-class margins and logits), which are the dependent variables of the regression tasks.
  2. Datamodels estimated from this data using LASSO.

For each dataset, there are multiple versions of the data depending on the choice of the hyperparameter α, the subsampling fraction (this is the random fraction of training examples on which each model is trained; see Section 2 of our paper for more information).

Following table shows the number of models we trained and used for estimating datamodels (also see Table 1 in paper):

Subsampling α (%) CIFAR-10 FMoW
10 1,500,000 N/A
20 750,000 375,000
50 300,000 150,000
75 600,000 300,000

Training data

For each dataset and $\alpha$, we provide the following data:

# M is the number of models trained
/{DATASET}/data/train_masks_{PCT}pct.npy  # [M x N_train] boolean
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_test] np.float16
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_train] np.float16

(The files live in the Amazon S3 bucket madrylab-datamodels; we provide instructions for acces in the next section.)

Each row of the above matrices corresponds to one instance of model trained; each column corresponds to a training or test example. CIFAR-10 examples are organized in the default order; for FMoW, see here. For example, a train mask for CIFAR-10 has the shape [M x 50,000].

For CIFAR-10, we also provide the full logits for all ten classes:

/cifar/data/train_logits_{PCT}pct.npy  # [M x N_test x 10] np.float16
/cifar/data/test_logits_{PCT}pct.npy   # [M x N_test x 10] np.float16

Note that you can also compute the margins from these logits.

We include an addtional 10,000 models for each setting that we used for evaluation; the total number of models in each matrix is M as indicated in the above table plus 10,000.

Datamodels

All estimated datamodels for each split (train or test) are provided as a dictionary in a .pt file (load with torch.load):

/{DATASET}/datamodels/train_{PCT}pct.pt
/{DATASET}/datamodels/test_{PCT}pct.pt

Each dictionary contains:

  • weight: matrix of shape N_train x N, where N is either N_train or N_test depending on the group of target examples
  • bias: vector of length N, corresponding to biases for each datamodel
  • lam: vector of length N, regularization λ chosen by CV for each datamodel

Downloading

We make all of our data available via Amazon S3. Total sizes of the training data files are as follows:

Dataset, α (%) masks, margins (GB) logits (GB)
CIFAR-10, 10 245 1688
CIFAR-10, 20 123 849
CIFAR-10, 50 49 346
CIFAR-10, 75 98 682
FMoW, 20 25.4 -
FMoW, 50 10.6 -
FMoW, 75 21.2 -

Total sizes of datamodels data (the model weights) are 16.9 GB for CIFAR-10 and 0.75 GB for FMoW.

API

You can download them using the Amazon S3 CLI interface with the requester pays option as follows (replacing the fields {...} as appropriate):

aws s3api get-object --bucket madrylab-datamodels \
                     --key {DATASET}/data/{SPLIT}_{DATA_TYPE}_{PCT}.npy \
                     --request-payer requester \
                     [OUT_FILE]

For example, to retrieve the test set margins for CIFAR-10 models trained on 50% subsets, use:

aws s3api get-object --bucket madrylab-datamodels \
                     --key cifar/data/test_margins_50pct.npy \
                     --request-payer requester \
                     test_margins_50pct.npy

Pricing

The total data transfer fee (from AWS to internet) for all of the data is around $374 (= 4155 GB x 0.09 USD per GB).

If you only download everything except for the logits (which is sufficient to reproduce all of our analysis), the fee is around $53.

Loading data

The data matrices are in numpy array format (.npy). As some of these are quite large, you can read small segments without reading the entire file into memory by additionally specifying the mmap_mode argument in np.load:

X = np.load('train_masks_10pct.npy', mmap_mode='r')
Y = np.load('test_margins_10pct.npy', mmap_mode='r')
...
# Use segments, e.g, X[:100], as appropriate
# Run regress(X, Y[:]) using choice of estimation algorithm.

FMoW data

We use a customized version of the FMoW dataset from WILDS (derived from this original dataset) that restricts the year of the training set to 2012. Our code is adapted from here.

To use the dataset, first download WILDS using:

pip install wilds

(see here for more detailed instructions).

In our paper, we only use the in-distribution training and test splits in our analysis (the original version from WILDS also has out-of-distribution as well as validation splits). Our dataset splits can be constructed as follows and used like a PyTorch dataset:

from fmow import FMoWDataset

ds = FMoWDataset(root_dir='/mnt/nfs/datasets/wilds/',
                     split_scheme='time_after_2016')

transform_steps = [
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]
transform = transforms.Compose(transform_steps)

ds_train = ds.get_subset('train', transform=transform)
ds_test = ds.get_subset('id_test', transform=transform)

The columns of matrix data described above is ordered according to the default ordering of examples given by the above constructors.

Owner
Madry Lab
Towards a Principled Science of Deep Learning
Madry Lab
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

Paweł Rościszewski 131 Dec 12, 2022
A repository of PyBullet utility functions for robotic motion planning, manipulation planning, and task and motion planning

pybullet-planning (previously ss-pybullet) A repository of PyBullet utility functions for robotic motion planning, manipulation planning, and task and

Caelan Garrett 260 Dec 27, 2022
Basic Docker Compose for Machine Learning Purposes

Docker-compose for Machine Learning How to use: cd docker-ml-jupyterlab

Chris Chen 1 Oct 29, 2021
Code Repository for Machine Learning with PyTorch and Scikit-Learn

Code Repository for Machine Learning with PyTorch and Scikit-Learn

Sebastian Raschka 1.4k Jan 03, 2023
虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

🎉 第二版本 🎉 (现货趋势网格) 介绍 在第一版本的基础上 趋势判断,不在固定点位开单,选择更优的开仓点位 优势: 🎉 简单易上手 安全(不用将api_secret告诉他人) 如何启动 修改app目录下的authorization文件

幸福村的码农 250 Jan 07, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 09, 2023
A simple machine learning package to cluster keywords in higher-level groups.

Simple Keyword Clusterer A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" -- "Fronte

Andrea D'Agostino 10 Dec 18, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023
NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

SUN Group @ UMN 28 Aug 03, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 01, 2023
UpliftML: A Python Package for Scalable Uplift Modeling

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base l

Booking.com 254 Dec 31, 2022
Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

DataCanvas 216 Dec 23, 2022
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
LightGBM + Optuna: no brainer

AutoLGBM LightGBM + Optuna: no brainer auto train lightgbm directly from CSV files auto tune lightgbm using optuna auto serve best lightgbm model usin

Rishiraj Acharya 22 Dec 15, 2022
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 03, 2023
Toolss - Automatic installer of hacking tools (ONLY FOR TERMUKS!)

Tools Автоматический установщик хакерских утилит (ТОЛЬКО ДЛЯ ТЕРМУКС!) Оригиналь

14 Jan 05, 2023
Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill This is a port of the amazing openskill.js package

Open Debates Project 156 Dec 14, 2022
Class-imbalanced / Long-tailed ensemble learning in Python. Modular, flexible, and extensible

IMBENS: Class-imbalanced Ensemble Learning in Python Language: English | Chinese/中文 Links: Documentation | Gallery | PyPI | Changelog | Source | Downl

Zhining Liu 176 Jan 04, 2023