Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Overview

SEDE

sede ci

SEDE (Stack Exchange Data Explorer) is new dataset for Text-to-SQL tasks with more than 12,000 SQL queries and their natural language description. It's based on a real usage of users from the Stack Exchange Data Explorer platform, which brings complexities and challenges never seen before in any other semantic parsing dataset like including complex nesting, dates manipulation, numeric and text manipulation, parameters, and most importantly: under-specification and hidden-assumptions.

Paper (NLP4Prog workshop at ACL2021): Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data.


sede sql

Setup Instructions

Create a new Python 3.7 virtual environment:

python3.7 -m venv .venv

Activate the virtual environment:

source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Add the project directory to python PATH:

export PYTHONPATH=/your/projects-directories/sede:$PYTHONPATH

One can run all commands by just running make command, or running them step by step by the following commands:

Run pylint:

make lint

Run black:

make black_check

Run tests (required JSQL running for this - please see "Running JSQLParser" chapter):

make unit_test

Add the virtual environment to Jupyter Notebook:

python3.7 -m ipykernel install --user --name=.venv

Now you can enter into Jupyter with the command jupyter notebook and when creating a new notebook you will need to choose the .venv environment.

Folders Navigation

  • src - source code
  • configs - contains configuration files for running experiments
  • data/sede - train/val/test sets of SEDE. Note - files with the _original suffix are the ones that we kept original as coming from SEDE without our fixes. See our paper for more details.
  • notebooks - some helper Jupyter notebooks.
  • stackexchange_schema - holds file that respresents the SEDE schema.

Running JSQLParser

Clone JSQLParser-as-a-Service project: git clone https://github.com/hirupert/jsqlparser-as-a-service.git

Enter the folder with cd jsqlparser-as-a-service

Build the JSQLParser-as-a-Service image using the following command: docker build -t jsqlparser-as-a-service .

Running the image inside a docker container in port 8079: docker run -d -p 8079:8079 jsqlparser-as-a-service

Test that the docker is running by running the following command:

curl --location --request POST 'http://localhost:8079/sqltojson' \
--header 'Content-Type: application/json' \
--data-raw '{
    "sql":"select salary from employees where salary < (select max(salary) from employees)"
}'

Training T5 model

Training SEDE:

python main_allennlp.py train configs/t5_text2sql_sede.jsonnet -s experiments/name_of_experiment --include-package src

Training Spider:

In order to run our model + Partial Components Match F1 metric on Spider dataset, one must download Spider dataset from here: https://yale-lily.github.io/spider and save it under data/spider folder inside the root project directory. After that, one can run the following command in order to train our model on Spider dataset:

python main_allennlp.py train configs/t5_text2sql_spider.jsonnet -s experiments/name_of_experiment --include-package src

Evaluation (SEDE)

Run evaluation on SEDE validation set with:

python main_allennlp.py evaluate experiments/name_of_experiment data/sede/val.jsonl --output-file experiments/name_of_experiment/val_predictions.sql --cuda-device 0 --batch-size 10 --include-package src

Run evaluation on SEDE test set with:

python main_allennlp.py evaluate experiments/name_of_experiment data/sede/test.jsonl --output-file experiments/name_of_experiment/test_predictions.sql --cuda-device 0 --batch-size 10 --include-package src

Note - In order to evaluate a trained model on Spider, one needs to replace the experiment name and the data path to: data/spider/dev.json.

Inference (SEDE)

Predict SQL queries on SEDE validation set with:

python main_allennlp.py predict experiments/name_of_experiment data/sede/val.jsonl --output-file experiments/name_of_experiment/val_predictions.sql --use-dataset-reader --predictor seq2seq2 --cuda-device 0 --batch-size 10 --include-package src

Predict SQL queries on SEDE test set with:

python main_allennlp.py predict experiments/name_of_experiment data/sede/test.jsonl --output-file experiments/name_of_experiment/val_predictions.sql --use-dataset-reader --predictor seq2seq2 --cuda-device 0 --batch-size 10 --include-package src

Note - In order to run inference with a trained model on Spider (validation set), one needs to replace the experiment name and the data path to: data/spider/dev.json.

Acknowledgements

We thank Kevin Montrose and the rest of the Stack Exchange team for providing the raw query log.

Owner
Rupert.
Rupert.
FishNet: One Stage to Detect, Segmentation and Pose Estimation

FishNet FishNet: One Stage to Detect, Segmentation and Pose Estimation Introduction In this project, we combine target detection, instance segmentatio

1 Oct 05, 2022
DRIFT is a tool for Diachronic Analysis of Scientific Literature.

About DRIFT is a tool for Diachronic Analysis of Scientific Literature. The application offers user-friendly and customizable utilities for two modes:

Rajaswa Patil 108 Dec 12, 2022
A Closer Look at Structured Pruning for Neural Network Compression

A Closer Look at Structured Pruning for Neural Network Compression Code used to reproduce experiments in https://arxiv.org/abs/1810.04622. To prune, w

Bayesian and Neural Systems Group 140 Dec 05, 2022
Tutorials and implementations for "Self-normalizing networks"

Self-Normalizing Networks Tutorials and implementations for "Self-normalizing networks"(SNNs) as suggested by Klambauer et al. (arXiv pre-print). Vers

Institute of Bioinformatics, Johannes Kepler University Linz 1.6k Jan 07, 2023
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

BABEL is a large dataset with language labels describing the actions being performed in mocap sequences. BABEL labels about 43 hours of mocap sequences from AMASS [1] with action labels.

113 Dec 28, 2022
Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.

Optimum Transformers Accelerated NLP pipelines for fast inference 🚀 on CPU and GPU. Built with 🤗 Transformers, Optimum and ONNX runtime. Installatio

Aleksey Korshuk 115 Dec 16, 2022
Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Learned Virtual View Visibility ICCV2021

Vis2Mesh This is the offical repository of the paper: Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Lear

71 Dec 25, 2022
Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022
PyTorch implementation of "VRT: A Video Restoration Transformer"

VRT: A Video Restoration Transformer Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, Luc Van Gool Computer

Jingyun Liang 837 Jan 09, 2023
Awesome Monocular 3D detection

Awesome Monocular 3D detection Paper list of 3D detetction, keep updating! Contents Paper List 2022 2021 2020 2019 2018 2017 2016 KITTI Results Paper

Zhikang Zou 184 Jan 04, 2023
Code for the Shortformer model, from the paper by Ofir Press, Noah A. Smith and Mike Lewis.

Shortformer This repository contains the code and the final checkpoint of the Shortformer model. This file explains how to run our experiments on the

Ofir Press 138 Apr 15, 2022
sense-py-AnishaBaishya created by GitHub Classroom

Compute Statistics Here we compute statistics for a bunch of numbers. This project uses the unittest framework to test functionality. Pass the tests T

1 Oct 21, 2021
[CVPR'21] DeepSurfels: Learning Online Appearance Fusion

DeepSurfels: Learning Online Appearance Fusion Paper | Video | Project Page This is the official implementation of the CVPR 2021 submission DeepSurfel

Online Reconstruction 52 Nov 14, 2022
Face Detection & Age Gender & Expression & Recognition

Face Detection & Age Gender & Expression & Recognition

Sajjad Ayobi 188 Dec 28, 2022
Equivariant CNNs for the sphere and SO(3) implemented in PyTorch

Equivariant CNNs for the sphere and SO(3) implemented in PyTorch

Jonas Köhler 893 Dec 28, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 09, 2023
PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision.

PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{CV2018, author = {Donny You ( Donny You 40 Sep 14, 2022

A simple and extensible library to create Bayesian Neural Network layers on PyTorch.

Blitz - Bayesian Layers in Torch Zoo BLiTZ is a simple and extensible library to create Bayesian Neural Network Layers (based on whats proposed in Wei

Pi Esposito 722 Jan 08, 2023
PyTorch Implementation of Exploring Explicit Domain Supervision for Latent Space Disentanglement in Unpaired Image-to-Image Translation.

DosGAN-PyTorch PyTorch Implementation of Exploring Explicit Domain Supervision for Latent Space Disentanglement in Unpaired Image-to-Image Translation

40 Nov 30, 2022