Data from "Datamodels: Predicting Predictions with Training Data"

Overview

Data from "Datamodels: Predicting Predictions with Training Data"

Here we provide the data used in the paper "Datamodels: Predicting Predictions with Training Data" (arXiv, Blog).

Note that all of the data below is stored on Amazon S3 using the “requester pays” option to avoid a blowup in our data transfer costs (we put estimated AWS costs below)---if you are on a budget and do not mind waiting a bit longer, please contact us at [email protected] and we can try to arrange a free (but slower) transfer.

Citation

To cite this data, please use the following BibTeX entry:

@inproceedings{ilyas2022datamodels,
  title = {Datamodels: Predicting Predictions from Training Data},
  author = {Andrew Ilyas and Sung Min Park and Logan Engstrom and Guillaume Leclerc and Aleksander Madry},
  booktitle = {ArXiv preprint arXiv:2202.00622},
  year = {2022}
}

Overview

We provide the data used in our paper to analyze two image classification datasets: CIFAR-10 and (a modified version of) FMoW.

For each dataset, the data consists of two parts:

  1. Training data for datamodeling, which consists of:
    • Training subsets or "training masks", which are the independent variables of the regression tasks; and
    • Model outputs (correct-class margins and logits), which are the dependent variables of the regression tasks.
  2. Datamodels estimated from this data using LASSO.

For each dataset, there are multiple versions of the data depending on the choice of the hyperparameter α, the subsampling fraction (this is the random fraction of training examples on which each model is trained; see Section 2 of our paper for more information).

Following table shows the number of models we trained and used for estimating datamodels (also see Table 1 in paper):

Subsampling α (%) CIFAR-10 FMoW
10 1,500,000 N/A
20 750,000 375,000
50 300,000 150,000
75 600,000 300,000

Training data

For each dataset and $\alpha$, we provide the following data:

# M is the number of models trained
/{DATASET}/data/train_masks_{PCT}pct.npy  # [M x N_train] boolean
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_test] np.float16
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_train] np.float16

(The files live in the Amazon S3 bucket madrylab-datamodels; we provide instructions for acces in the next section.)

Each row of the above matrices corresponds to one instance of model trained; each column corresponds to a training or test example. CIFAR-10 examples are organized in the default order; for FMoW, see here. For example, a train mask for CIFAR-10 has the shape [M x 50,000].

For CIFAR-10, we also provide the full logits for all ten classes:

/cifar/data/train_logits_{PCT}pct.npy  # [M x N_test x 10] np.float16
/cifar/data/test_logits_{PCT}pct.npy   # [M x N_test x 10] np.float16

Note that you can also compute the margins from these logits.

We include an addtional 10,000 models for each setting that we used for evaluation; the total number of models in each matrix is M as indicated in the above table plus 10,000.

Datamodels

All estimated datamodels for each split (train or test) are provided as a dictionary in a .pt file (load with torch.load):

/{DATASET}/datamodels/train_{PCT}pct.pt
/{DATASET}/datamodels/test_{PCT}pct.pt

Each dictionary contains:

  • weight: matrix of shape N_train x N, where N is either N_train or N_test depending on the group of target examples
  • bias: vector of length N, corresponding to biases for each datamodel
  • lam: vector of length N, regularization λ chosen by CV for each datamodel

Downloading

We make all of our data available via Amazon S3. Total sizes of the training data files are as follows:

Dataset, α (%) masks, margins (GB) logits (GB)
CIFAR-10, 10 245 1688
CIFAR-10, 20 123 849
CIFAR-10, 50 49 346
CIFAR-10, 75 98 682
FMoW, 20 25.4 -
FMoW, 50 10.6 -
FMoW, 75 21.2 -

Total sizes of datamodels data (the model weights) are 16.9 GB for CIFAR-10 and 0.75 GB for FMoW.

API

You can download them using the Amazon S3 CLI interface with the requester pays option as follows (replacing the fields {...} as appropriate):

aws s3api get-object --bucket madrylab-datamodels \
                     --key {DATASET}/data/{SPLIT}_{DATA_TYPE}_{PCT}.npy \
                     --request-payer requester \
                     [OUT_FILE]

For example, to retrieve the test set margins for CIFAR-10 models trained on 50% subsets, use:

aws s3api get-object --bucket madrylab-datamodels \
                     --key cifar/data/test_margins_50pct.npy \
                     --request-payer requester \
                     test_margins_50pct.npy

Pricing

The total data transfer fee (from AWS to internet) for all of the data is around $374 (= 4155 GB x 0.09 USD per GB).

If you only download everything except for the logits (which is sufficient to reproduce all of our analysis), the fee is around $53.

Loading data

The data matrices are in numpy array format (.npy). As some of these are quite large, you can read small segments without reading the entire file into memory by additionally specifying the mmap_mode argument in np.load:

X = np.load('train_masks_10pct.npy', mmap_mode='r')
Y = np.load('test_margins_10pct.npy', mmap_mode='r')
...
# Use segments, e.g, X[:100], as appropriate
# Run regress(X, Y[:]) using choice of estimation algorithm.

FMoW data

We use a customized version of the FMoW dataset from WILDS (derived from this original dataset) that restricts the year of the training set to 2012. Our code is adapted from here.

To use the dataset, first download WILDS using:

pip install wilds

(see here for more detailed instructions).

In our paper, we only use the in-distribution training and test splits in our analysis (the original version from WILDS also has out-of-distribution as well as validation splits). Our dataset splits can be constructed as follows and used like a PyTorch dataset:

from fmow import FMoWDataset

ds = FMoWDataset(root_dir='/mnt/nfs/datasets/wilds/',
                     split_scheme='time_after_2016')

transform_steps = [
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]
transform = transforms.Compose(transform_steps)

ds_train = ds.get_subset('train', transform=transform)
ds_test = ds.get_subset('id_test', transform=transform)

The columns of matrix data described above is ordered according to the default ordering of examples given by the above constructors.

Owner
Madry Lab
Towards a Principled Science of Deep Learning
Madry Lab
Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Bottleneck Bottleneck is a collection of fast, NaN-aware NumPy array functions written in C. As one example, to check if a np.array has any NaNs using

Python for Data 835 Dec 27, 2022
XManager: A framework for managing machine learning experiments 🧑‍🔬

XManager is a platform for packaging, running and keeping track of machine learning experiments. It currently enables one to launch experiments locally or on Google Cloud Platform (GCP). Interaction

DeepMind 620 Dec 27, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

Salesforce 2.8k Jan 05, 2023
A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching. The solver will solve equations of the type: A can be

Sanjeet N. Dasharath 3 Feb 15, 2022
Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

Happy N. Monday 3 Feb 15, 2022
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022
Polyglot Machine Learning example for scraping similar news articles.

Polyglot Machine Learning example for scraping similar news articles In this example, we will see how we can work with Machine Learning applications w

MetaCall 15 Mar 28, 2022
Binary Classification Problem with Machine Learning

Binary Classification Problem with Machine Learning Solving Approach: 1) Ultimate Goal of the Assignment: This assignment is about solving a binary cl

Dinesh Mali 0 Jan 20, 2022
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Simple, light-weight config handling through python data classes with to/from JSON serialization/deserialization.

Simple but maybe too simple config management through python data classes. We use it for machine learning.

Eren Gölge 67 Nov 29, 2022
Kalman filter library

The kalman filter framework described here is an incredibly powerful tool for any optimization problem, but particularly for visual odometry, sensor fusion localization or SLAM.

comma.ai 276 Jan 01, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Criteo 419 Jan 01, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks l

Ibotta 84 Dec 29, 2022
OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

OptaPy 208 Dec 27, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

ZhihuiYangCS 8 Jun 07, 2022
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

Two Sigma 970 Jan 04, 2023