Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy

Overview

lbs-data

Motivation

Location data is collected from the public by private firms via mobile devices. Can this data also be used to serve the public good while preserving privacy? Can we realize this goal by generating synthetic data for use instead of the real data? The synthetic data would need to balance utility and privacy.

Overview

What:

This project uses location based services (LBS) data provided by a location intelligence company in order to train a RNN model to generate synthetic location data. The goal is for the synthetic data to maintain the properties of the real data, at the individual and aggregate levels, in order to retain its utility. At the same time, the synthetic data should sufficiently differ from the real data at the individual level, in order to preserve user privacy.

Furthermore, the system uses home and work areas as labels and inputs in order to generate location data for synthetic users with the given home and work areas.
This addresses the issue of limited sample sizes. Population data, such as census data, can be used to create the input necessary to output a synthetic location dataset that represents the true population in size and distribution.

Data

/data/

ACS data

data/ACS/ma_acs_5_year_census_tract_2018/

Population data is sourced from the 2018 American Community Survey 5-year estimates.

LBS data

/data/mount/

Privately stored on a remote server.

Geography and time period

  • Geography: The region of study is limited to 3 counties surrounding Boston, MA.
  • Time period: The training and output data is for the first 5-day workweek of May 2018.

Data representation

The LBS data are provided as rows.

device ID, latitude, longitude, timestamp, dwelltime

The data are transformed into "stay trajectories", which are sequences where each index of a sequence represents a 1-hour time interval. Each stay trajectory represents the data for one user (device ID). The value at that index represents the location/area (census tract) where the user spent the most time during that 1-hour interval.

e.g.

[A,B,D,C,A,A,A,NULL,B...]

Where each letter represents a location. There are null values when no location data is reported in the time interval.

home and work locations are inferred for each user stay trajectory. stay trajectories are prefixed with the home and work locations. This home, work prefixes then serve as labels.

[home,work,A,B,D,C,A,A,A,NULL,B...]

Where home,work values are also elements (frequently) occuring in their associated stay trajectory (e.g. home=A).

These sequences are used to train the model and are also output by the model.

RNN

The RNN model developed in this work is meant to be simple and replicable. It was implemented via the open source textgenrnn library. https://github.com/minimaxir/textgenrnn.

Many models (>70) are trained with a variety of hyper parameter values. The models are each trained on the same training data and then use the same input (home, work labels) to generate output synthetic data. The output is evalued via a variety of utility and privacy metrics in order to determine the best model/parameters.

Pipeline

Preprocessing

Define geography / shapefiles

./shapefile_shaper.ipynb

Our study uses 3 counties surrounding Boston, MA: Middlesex, Norfolk, Suffolk counties.

shapefile_shaper prunes MA shapefiles for this geography.

Output files are in ./shapefiles/ma/

Census tracts are used as "areas"/locations in stay trajectories.

Data filtering

./preprocess_filtering.ipynb

The LBS data is sparse. Some users report just a few datapoints, while other users report many. In order to confidently infer home and work locations, and learn patterns, we only include data from devices with sufficient reporting.

./preprocess_filtering.ipynb filters the data accordingly. It pokes the data to try to determine what the right level of filtering is. It outputs saved files with filtered data. Namely, it saves a datafile with LBS data from devices that reported at least 3 days and 3 nights of data during the 1 workweek of the study period. This is the pruned dataset used in the following work.

Attach areas

/attach_areas.ipynb

Census areas are attached to LBS data rows.

Home, work inference

./infer_home_work.ipynb

Defines functions to infer home and work locations (census tracts ) for each device user, based on their LBS data. The home location is where the user spends most time in nighttime hours. The "work" location is where the user spends the most time in workday hours. These locations can be the same.

This file helps determine good hours to use for nighttime hours. Once the functions are defined, they are used to evaluate the data representativeness by comparing the inferred population statistics to ACS 2018 census data.

Saves a mapping of LBS user IDS to the inferred home,work locations.

Stay trajectories setup

./trajectory_synthesis/trajectory_synthesis_notebook.ipynb

Transforms preprocessed LBS data into prefixed stay trajectories.

And outputs files for model training, data generation, and comparison.

Note: for the purposes of model training and data generation, the area tokens within stay trajectories can be arbitrary. What is important for the model’s success is the relationship between them. In order to save the stay trajectories in this repository yet keep real data private, we do the following. We map real census areas to integers, and map areas in stay trajectories to the integers representing the areas. We use the transformed stay trajectories for model training and data generation. The mapping between real census areas and their integer representations is kept private. We can then map the integers in stay trajectories back to the real areas they represent when needed (such as when evaluating trip distance metrics).

Output files:

./data/relabeled_trajectories_1_workweek.txt: D: Full training set of 22704 trajectories

./data/relabeled_trajectories_1_workweek_prefixes_to_counts.json: Maps D home,work label prefixes to counts

./data/relabeled_trajectories_1_workweek_sample_2000.txt: S: Random sample of 2000 trajectories from D.

./data/relabeled_trajectories_1_workweek_prefixes_to_counts_sample_2000.json: Maps S home,work label prefixes to counts

  • This is used as the input for data generation so that the output sythetic sample, S', has a home,work label pair distribution that matches S.

Model training and data generation

./trajectory_synthesis/textgenrnn_generator/

Models with a variety of hyperparameter combinations were trained and then used to generate a synthetic sample.

The files model_trainer.py and generator.py are the templates for the scripts used to train and generate.

The model (hyper)parameter combinations were tracked in a spreadsheet. ./trajectory_synthesis/textgenrnn_generator/textgenrnn_model_parameters_.csv

Evaluation

./trajectory_synthesis/evaluation/evaluate_rnn.ipynb

A variety of utility and privacy evaluation tools and metrics were developed. Models were evaluated by their synthetic data outputs (S'). This was done in ./trajectory_synthesis/evaluation/evaluate_rnn.ipynb. The best model (i.e. best parameters) was determined by these evaluations. The results for this model are captured in trajectory_synthesis/evaluation/final_eval_plots.ipynb.

Owner
Alex
Systems Architect, product oriented Engineer, Hacker for the social good, Math Nerd that loves solving hard problems and working with great people.
Alex
Code artifacts for the submission "Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems"

Code Artifacts Code artifacts for the submission "Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driv

Andrea Stocco 2 Aug 24, 2022
Code to replicate the key results from Exploring the Limits of Out-of-Distribution Detection

Exploring the Limits of Out-of-Distribution Detection In this repository we're collecting replications for the key experiments in the Exploring the Li

Stanislav Fort 35 Jan 03, 2023
Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank

This repository provides the official code for replicating experiments from the paper: Semi-Supervised Semantic Segmentation with Pixel-Level Contrast

Iñigo Alonso Ruiz 58 Dec 15, 2022
Churn prediction

Churn-prediction Churn-prediction Data preprocessing:: Label encoder is used to normalize the categorical variable Data Transformation:: For each data

1 Sep 28, 2022
The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

Rule-based Representation Learner This is a PyTorch implementation of Rule-based Representation Learner (RRL) as described in NeurIPS 2021 paper: Scal

Zhuo Wang 53 Dec 17, 2022
Implementation of "Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis"

Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis Abstract: This work targets at using a general deep lea

163 Dec 14, 2022
This repository is for Contrastive Embedding Distribution Refinement and Entropy-Aware Attention Network (CEDR)

CEDR This repository is for Contrastive Embedding Distribution Refinement and Entropy-Aware Attention Network (CEDR) introduced in the following paper

phoenix 3 Feb 27, 2022
Code for "Unsupervised Source Separation via Bayesian inference in the latent domain"

LQVAE-separation Code for "Unsupervised Source Separation via Bayesian inference in the latent domain" Paper Samples GT Compressed Separated Drums GT

Michele Mancusi 30 Oct 25, 2022
"Learning Free Gait Transition for Quadruped Robots vis Phase-Guided Controller"

PhaseGuidedControl The current version is developed based on the old version of RaiSim series, and possibly requires further modification. It will be

X-Mechanics 12 Oct 21, 2022
Contrastive Learning for Compact Single Image Dehazing, CVPR2021

AECR-Net Contrastive Learning for Compact Single Image Dehazing, CVPR2021. Official Pytorch based implementation. Paper arxiv Pytorch Version TODO: mo

glassy 253 Jan 01, 2023
Running AlphaFold2 (from ColabFold) in Azure Machine Learning

Running AlphaFold2 (from ColabFold) in Azure Machine Learning Colby T. Ford, Ph.D. Companion repository for Medium Post: How to predict many protein s

Colby T. Ford 3 Feb 18, 2022
Pytorch implementation of Nueral Style transfer

Nueral Style Transfer Pytorch implementation of Nueral style transfer algorithm , it is used to apply artistic styles to content images . Content is t

Abhinav 9 Oct 15, 2022
Fine-tuning StyleGAN2 for Cartoon Face Generation

Cartoon-StyleGAN 🙃 : Fine-tuning StyleGAN2 for Cartoon Face Generation Abstract Recent studies have shown remarkable success in the unsupervised imag

Jihye Back 520 Jan 04, 2023
Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks.

Self Supervised Learning with Fastai Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks. Install pip install self-

Kerem Turgutlu 276 Dec 23, 2022
Discord Multi Tool that focuses on design and easy usage

Multi-Tool-v1.0 Discord Multi Tool that focuses on design and easy usage Delete webhook Block all friends Spam webhook Modify webhook Webhook info Tok

Lodi#0001 24 May 23, 2022
Research using Cirq!

ReCirq Research using Cirq! This project contains modules for running quantum computing applications and experiments through Cirq and Quantum Engine.

quantumlib 230 Dec 29, 2022
The code for the NeurIPS 2021 paper "A Unified View of cGANs with and without Classifiers".

Energy-based Conditional Generative Adversarial Network (ECGAN) This is the code for the NeurIPS 2021 paper "A Unified View of cGANs with and without

sianchen 22 May 28, 2022
:fire: 2D and 3D Face alignment library build using pytorch

Face Recognition Detect facial landmarks from Python using the world's most accurate face alignment network, capable of detecting points in both 2D an

Adrian Bulat 6k Dec 31, 2022
Official code for paper Exemplar Based 3D Portrait Stylization.

3D-Portrait-Stylization This is the official code for the paper "Exemplar Based 3D Portrait Stylization". You can check the paper on our project websi

60 Dec 07, 2022
SpanNER: Named EntityRe-/Recognition as Span Prediction

SpanNER: Named EntityRe-/Recognition as Span Prediction Overview | Demo | Installation | Preprocessing | Prepare Models | Running | System Combination

NeuLab 104 Dec 17, 2022