The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Last update: Jan 06, 2023

Related tags

Overview

PRIMER

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization.

PRIMER is a pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on the zero-shot, few-shot and full-supervised settings, PRIMER outperforms current state-of-the-art models on most of these settings with large margins.

Set up

Create new virtual environment by

conda create --name primer python=3.7
conda activate primer
conda install cudatoolkit=10.0

Install Longformer by

pip install git+https://github.com/allenai/longformer.git

Install requirements to run the summarization scripts and data generation scripts by

pip install -r requirements.txt

Usage of PRIMER

Download the pre-trained PRIMER model here to ./PRIMER_model
Load the tokenizer and model by

from transformers import AutoTokenizer
from longformer import LongformerEncoderDecoderForConditionalGeneration
from longformer import LongformerEncoderDecoderConfig

tokenizer = AutoTokenizer.from_pretrained('./PRIMER_model/')
config = LongformerEncoderDecoderConfig.from_pretrained('./PRIMER_model/')
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(
            './PRIMER_model/', config=config)

Make sure the documents separated with <doc-sep> in the input.

Summarization Scripts

You can use script/primer_main.py for pre-train/train/test PRIMER, and script/compared_model_main.py for train/test BART/PEGASUS/LED.

Pre-training Data Generation

Newshead: we crawled the newshead dataset using the original code, and cleaned up the crawled data, the final newshead dataset can be found here.

You can use utils/pretrain_preprocess.py to generate pre-training data.

Generate data with scores and entities with --mode compute_all_scores
Generate pre-training data with --mode pretraining_data_with_score:
- Pegasus: --strategy greedy --metric pegasus_score
- Entity_Pyramid: --strategy greedy_entity_pyramid --metric pyramid_rouge

Datasets

For Multi-News and Multi-XScience, it will automatically download from Huggingface.
WCEP-10: the preprocessed version can be found here
Wikisum: we only use a small subset for few-shot training(10/100) and testing(3200). The subset we used can be found here. Note we have significantly more examples than we used in train.pt and valid.pt, as we sample 10/100 examples multiple times in the few-shot setting, and we need to make sure it has a large pool to sample from.
DUC2003/2004: You need to apply for access based on the instruction
arXiv: you can find the data we used in this repo

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Related tags

Overview

PRIMER

Set up

Usage of PRIMER

Summarization Scripts

Pre-training Data Generation

Datasets

Owner

AI2

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

A PyTorch Image-Classification With AlexNet And ResNet50.

AI创造营：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Evolution Strategies in PyTorch

Code for ICLR 2021 Paper, "Anytime Sampling for Autoregressive Models via Ordered Autoencoding"

A tutorial on training a DarkNet YOLOv4 model for the CrowdHuman dataset

BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

An implementation of the [Hierarchical (Sig-Wasserstein) GAN] algorithm for large dimensional Time Series Generation

Implementation of Uformer, Attention-based Unet, in Pytorch

Spectral Temporal Graph Neural Network (StemGNN in short) for Multivariate Time-series Forecasting

Unified API to facilitate usage of pre-trained "perceptor" models, a la CLIP

🎯 A comprehensive gradient-free optimization framework written in Python

Continuum Learning with GEM: Gradient Episodic Memory

Dense Prediction Transformers

YOLO-v5 기반 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adaptive Cruise Control 기능 구현

Official implementation of NeuralFusion: Online Depth Map Fusion in Latent Space

Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Dilated Convolution with Learnable Spacings PyTorch

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Related tags

Overview

PRIMER

Set up

Usage of PRIMER

Summarization Scripts

Pre-training Data Generation

Datasets

Owner

AI2

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

A PyTorch Image-Classification With AlexNet And ResNet50.

AI创造营 ：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Evolution Strategies in PyTorch

Code for ICLR 2021 Paper, "Anytime Sampling for Autoregressive Models via Ordered Autoencoding"

A tutorial on training a DarkNet YOLOv4 model for the CrowdHuman dataset

BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

An implementation of the [Hierarchical (Sig-Wasserstein) GAN] algorithm for large dimensional Time Series Generation

Implementation of Uformer, Attention-based Unet, in Pytorch

Spectral Temporal Graph Neural Network (StemGNN in short) for Multivariate Time-series Forecasting

Unified API to facilitate usage of pre-trained "perceptor" models, a la CLIP

🎯 A comprehensive gradient-free optimization framework written in Python

Continuum Learning with GEM: Gradient Episodic Memory

Dense Prediction Transformers

YOLO-v5 기반 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adaptive Cruise Control 기능 구현

Official implementation of NeuralFusion: Online Depth Map Fusion in Latent Space

Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Dilated Convolution with Learnable Spacings PyTorch

AI创造营：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人