Benchmark for evaluating open-ended generation

Overview

OpenMEVA

Contributed by Jian Guan, Zhexin Zhang. Thank Jiaxin Wen for DeBugging.

OpenMEVA is a benchmark for evaluating open-ended story generation metrics (Please refer to the Paper List for more information about Open-eNded Language Generation tasks) described in the paper: OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics (ACL 2021 Long Paper). Besides, OpenMEVA also provides an open-source and extensible toolkit for metric implementation, evaluation, comparison, and analysis, as well as data perturbation techniques to help generate large numbers of customized test cases. We expect the toolkit to empower fast development of automatic metrics.

Contents

Introduction for Language Generation Evaluation

Since human evaluation is time-consuming, expensive, and difficult to reproduce, the community commonly uses automatic metrics for evaluation. We roughly divide existing metrics as follows:

  • Previous studies in conditional language generation tasks (e.g., machine translation) have developed several successful referenced metrics, which roughly quantify the lexical overlap (e.g., BLEU) or semantic entailment (e.g., BertScore) between a generated sample and the reference.
  • Referenced metrics correlate poorly with human judgments when evaluating open-ended language generation. Specifically, a generated sample can be reasonable if it is coherent to the given input, and self-consistent within its own context but not necessarily being similar to the reference in literal or semantics. To address the one-to-many issue, unreferenced metrics (e.g., UNION) are proposed to measure the quality of a generated sample without any reference.
  • Besides, some researchers propose to combine referenced and unreferenced metrics, i.e. hybrid metrics, which usually average two individual metric scores (e.g. RUBER) or learn from human preference (e.g., ADEM). However, ADEM is reported to lack generalization and robustness with limited human annotation.

The existing generation models are still far from human ability to generate reasonable texts, particularly for open-ended language generation tasks such as story generation. One important factor that hinders the research is the lack of powerful metrics for measuring generation quality. Therefore, we propose OpenMEVA as the standard paradigm for measuring progress of metrics.

Install

Clone the repository from our github page (don't forget to star us!)

git clone https://github.com/thu-coai/OpenMEVA.git

Then install all the requirements:

pip install -r requirements.txt

Then install the package with

python setup.py install

If you also want to modify the code, run this:

python setup.py develop

Toolkit

I. Metrics Interface

1. Metric List

We publish the standard implementation for the following metrics:

2. Usage

It is handy to construct a metric object and use it to evaluate given examples:

from eva.bleu import BLEU
metric = BLEU()

# for more information about the metric
print(metric.info)

# data is a list of dictionary [{"context": ..., "candidate":..., "reference": ...}]
print(metric.compute(data))

We present a python file test.py as an instruction to access the API.

These metrics are not exhaustive, it is a starting point for further metric research. We welcome any pull request for other metrics (requiring implementation of only three methods including __init__, info, compute).

3. Training Learnable Metrics

Execute the following command for training learnable metrics:

cd ./eva/model

# training language model for computing forward perplexity
bash ./run_language_modeling.sh

# training the unreferenced model for computing RUBER (RNN version)
bash ./run_ruber_unrefer.sh

# training the unreferenced model for computing RUBER (BERT version)
bash ./run_ruber_unrefer_bert.sh

# training the model for computing UNION
bash ./run_union.sh

II. Evaluating Human Scores

The python file test.py also includes detailed instruction to access the API for evaluating human scores.

1. Constructing

from eva.heva import Heva

# list of all possible human scores (int/float/str).
all_possible_score_list = [1,2,3,4,5]

# construct an object for following evaluation
heva = Heva(all_possible_score_list)

2. Consistency of human scores

# list of human score list, each row includes all the human scores for an example
human_score_list = [[1,3,2], [1,3,3], [2,3,1], ...]

print(heva.consistency(human_score_list))
# {"Fleiss's kappa": ..., "ICC correlation": ..., "Kendall-w":..., "krippendorff's alpha":...}
# the results includes correlation and p-value for significance test.

3. Mean Test for scores of examples from different source

# list of metric scores (float)
metric_score_1, metric_score_2 = [3.2, 2.4, 3.1,...], [3.5, 1.2, 2.3, ...]

# T-test for the means of two independent samples of scores.
print(heva.mean_test(metric_score_1, metric_score_2))
# {"t-statistic": ..., "p-value": ...}

4. Distribution of human scores

# list of human scores (float)
human_score = [2.0, 4.2, 1.2, 4.9, 2.6, 3.1, 4.0, 1.5,...]

# path for saving the figure of distribution
figure_path = "./figure"

# indicating the source of the annotated examples. default: ""
model_name = "gpt"

# plot the figure of distribution of human scores
heva.save_distribution_figure(score=human_score, save_path=figure_path, model_name=model_name, ymin=0, ymax=50)

5. Correlation between human and metric scores

# list of human scores (float)
human_score = [2.0, 4.2, 1.2, 4.9, 2.6, 3.1, 4.0, 1.5,...]

# list of metric scores (float)
metric_score = [3.2, 2.4, 3.1, 3.5, 1.2, 2.3, 3.5, 1.1,...]

# computing correlation
print(heva.correlation(metric_score, human_score))

# path for saving the figure of distribution
figure_path = "./figure"

# indicating the source of the metric scores. default: ""
metric_name = "bleu"

# plot the figure of metric score vs. human scores
heva.save_correlation_figure(human_score, metric_score, save_path=figure_path, metric_name=metric_name)

III. Perturbation Techniques

1. Perturbation List

We provide perturbation techniques in following aspects to create large scale test cases for evaluating comprehensive capabilities of metrics:

  • Lexical repetition

    • Repeating n-grams or sentences:

      He stepped on the stage and stepped on the stage.
  • Semantic repetition:

    • Repeating sentences with paraphrases by back translation:

      He has been from Chicago to Florida. He moved to Florida from Chicago.

  • Character behavior:

    • Reordering the subject and object of a sentence:

      Lars looked at the girl with desire.→ the girl looked at Lars with desire.
    • Substituting the personal pronouns referring to other characters:

      her mother took them to ... → their mother took her to ...
  • Common sense:

    • Substituting the head or tail entities in a commonsense triple of ConcepNet:

      Martha puts her dinner into theoven. She lays down fora quick nap. She oversleeps and runs into the kitchen (→ garden) to take out her burnt dinne.
  • Consistency:

    • Inserting or Deleting negated words or prefixes:

      She had (→ did not have) money to get vaccinated. She had a flu shot ...
      She agreed (→ disagreed) to get vaccinated.
    • Substituting words with antonyms:

      She is happy (→ upset) that she had a great time ...
  • Coherence:

    • Substituting words, phrases or sentences:

      Christmas was very soon. Kelly wanted to put up the Christmas tree. (→ Eventually it went into remission.)
  • Causal Relationship:

    • Reordering the cause and effect:

      the sky was clear so he could see clearly the boat. → he could see clearly the boat so the sky was clear.
    • Substituting the causality-related words randomly:

      the sky was clear so (→ because) he could see clearly the boat.
  • Temporal Relationship:

    • Reordering two sequential events:

      I eat one bite. Then I was no longer hungry.I was no longer hungry. Then I eat one bite.
    • Substituting the time-related words:

      After (→ Before) eating one bite I was no longer hungry.
  • Synonym:

    • Substituting a word with its synonym:

      I just purchased (→ bought) my uniforms.
  • Paraphrase:

    • Substituting a sentence with its paraphrase by back translation:

      Her dog doesn't shiver anymore.Her dog stops shaking.
  • Punctuation:

    • Inserting or Deleting inessential punctuation mark:

      Eventually,Eventually he became very hungry.
  • Contraction:

    • Contracting or Expanding contraction:

      I’ll (→ I will) have to keep waiting .
  • Typo:

    • Swapping two adjacent characters:

      that orange (→ ornage) broke her nose.
    • Repeating or Deleting a character:

      that orange (→ orannge) broke her nose.

2. Usage

It is handy to construct a perturbation object and use it to perturb given examples:

from eva.perturb.perturb import *
custom_name = "story"
method = add_typos(custom_name)

# data is a list of dictionary [{"id":0, "ipt": ..., "truth":...}, ...]
print(method.construct(data))
# the perturbed examples can be found under the directory "custom_name"

We present a python file test_perturb.py as an instruction to access the API.

You can download dependent files for some perturbation techniques by executing the following command:

cd ./eva/perturb
bash ./download.sh

You can also download them by THUCloud or Google Drive.

These perturbation techniques are not exhaustive, it is a starting point for further evaluation research. We welcome any pull request for other perturbation techniques (requiring implementation of only two methods including __init__, construct).

Note 📑 We adopt uda for back translation. We provide an example eva/perturb/back_trans_data/story_bt.json to indicate the format of the back translation result. And you can download the results for ROCStories and WritingPrompts by THUCloud or Google Drive.

Benchmark

I. Datasets

1. Machine-Generated Stories (MAGS) with manual annotation

We provide annotated stories from ROCStories (ROC) and WritingPrompts (WP). Some statistics are as follows:

Boxplot of annotation scores for each story source (Left: ROC, Right: WP):

2. Auto-Constructed Stories (ACTS)

We create large-scale test examples based on ROC and WP by aforementioned perturbation techniques. ACTS includes examples for different test types, i.e., discrimination test and invariance test.

  • The discrimination test requires metrics to distinguish human-written positive examples from negative ones. Wecreate each negative example by applying pertur-bation within an individual aspect. Besides, we also select different positive examples targeted for corresponding aspects. Below table shows the numbers of positive and negative examples in different aspects.

  • The invariance test expect the metric judgments to remain the same when we apply rationality-preserving perturbations, which means almost no influence on the quality of examples. The original examples can be either the human-written stories or the negative examples created in the discrimination test. Below table shows the numbers of original (also perturbed) positive and negative examples in different aspects.

3. Download & Data Instruction

You can download the whole dataset by THUCloud or Google Drive.

├── data
   └── `mags_data`
       ├── `mags_roc.json`	# sampled stories and corresponding human annotation.   
       ├── `mags_wp.json`		# sampled stories and corresponding human annotation.       
   └── `acts_data`
       ├── `roc`
              └── `roc_train_ipt.txt`	# input for training set
              └── `roc_train_opt.txt`	# output for training set
              └── `roc_valid_ipt.txt`	# input for validation set
              └── `roc_valid_opt.txt`	# output for validation set
              └── `roc_test_ipt.txt`	# input for test set
              └── `roc_test_opt.txt`	# output for test set
              └── `discrimination_test`                        
                 ├── `roc_lexical_rept.txt`
                 ├── `roc_lexical_rept_perturb.txt`										
                 ├── `roc_semantic_rept.txt`
                 ├── `roc_semantic_rept_perturb.txt`
                 ├── `roc_character.txt`
                 ├── `roc_character_perturb.txt`
                 ├── `roc_commonsense.txt`
                 ├── `roc_commonsense_perturb.txt`												
                 ├── `roc_coherence.txt`
                 ├── `roc_coherence_perturb.txt`
                 ├── `roc_consistency.txt`
                 ├── `roc_consistency_perturb.txt`								
                 ├── `roc_cause.txt`
                 ├── `roc_cause_perturb.txt`       										
                 ├── `roc_time.txt`
                 ├── `roc_time_perturb.txt`                    
              └── `invariance_test`
                 ├── `roc_synonym_substitute_perturb.txt`
                 ├── `roc_semantic_substitute_perturb.txt`
                 ├── `roc_contraction_perturb.txt`
                 ├── `roc_delete_punct_perturb.txt`
                 ├── `roc_typos_perturb.txt`
                 ├── `roc_negative_sample.txt`	# sampled negative samples from the discrimination test.	
                 ├── `roc_negative_sample_synonym_substitute_perturb.txt`
                 ├── `roc_negative_sample_semantic_substitute_perturb.txt`
                 ├── `roc_negative_sample_contraction_perturb.txt`
                 ├── `roc_negative_sample_delete_punct_perturb.txt`
                 ├── `roc_negative_sample_typos_perturb.txt`
       ├── `wp`
              └── ...

II. Tasks

OpenMEVA includes a suite of tasks to test comprehensive capabilities of metrics:

1. Correlation with human scores (based on MAGS)

2. Generalization across generation models and dataset (for learnable metrics, based on MAGS)

3. Judgment in general linguistic features (based on the discrimination test set of ACTS)

4. Robustness to rationality-preserving perturbations (based on the invariance test set of ACTS)

Note: The smaller absolute value of correlation is the better.

5. Fast Test

You can test these capabilities of new metrics by following command:

cd ./benchmark

# test correlation with human scores and generalization
python ./corr_gen.py

# test judgment
python ./judge.py

# test robustness
python ./robust.py

We take BLEU and Forward Perplexity as examples in the python files. You can test your own metrics by minor modification.

How to Cite

@misc{guan2021openmeva,
      title={OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics}, 
      author={Jian Guan and Zhexin Zhang and Zhuoer Feng and Zitao Liu and Wenbiao Ding and Xiaoxi Mao and Changjie Fan and Minlie Huang},
      year={2021},
      eprint={2105.08920},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

It's our honor to help you better explore language generation evaluation with our toolkit and benchmark.

Owner
Conversational AI groups from Tsinghua University
Pytorch implementation of SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation

SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation Efficient Self-Ensemble Framework for Semantic Segmentation by Walid Bousselham

61 Dec 26, 2022
PyTorch implementation for SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations Project | Paper | Colab PyTorch implementation of SDEdit: Image Synthesis a

536 Jan 05, 2023
Pytorch implementation of MaskGIT: Masked Generative Image Transformer

Pytorch implementation of MaskGIT: Masked Generative Image Transformer

Dominic Rampas 247 Dec 16, 2022
Build upon neural radiance fields to create a scene-specific implicit 3D semantic representation, Semantic-NeRF

Semantic-NeRF: Semantic Neural Radiance Fields Project Page | Video | Paper | Data In-Place Scene Labelling and Understanding with Implicit Scene Repr

Shuaifeng Zhi 243 Jan 07, 2023
A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

NNAISENSE 56 Jan 01, 2023
This package is for running the semantic SLAM algorithm using extracted planar surfaces from the received detection

Semantic SLAM This package can perform optimization of pose estimated from VO/VIO methods which tend to drift over time. It uses planar surfaces extra

Hriday Bavle 125 Dec 02, 2022
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and ap

3.4k Jan 04, 2023
Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code

Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code.

Yasunori Shimura 7 Jul 27, 2022
OpenPCDet Toolbox for LiDAR-based 3D Object Detection.

OpenPCDet OpenPCDet is a clear, simple, self-contained open source project for LiDAR-based 3D object detection. It is also the official code release o

OpenMMLab 3.2k Dec 31, 2022
Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Parallel and High-Fidelity Text-to-Lip Generation This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose P

Zhying 77 Dec 21, 2022
Code to reproduce the results for Compositional Attention

Compositional-Attention This repository contains the official implementation for the paper Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal 58 Nov 30, 2022
Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

DASR Paper Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution Jie Liang, Hui Zeng, and Lei Zhang. In arxiv preprint. Abs

81 Dec 28, 2022
Simple and Robust Loss Design for Multi-Label Learning with Missing Labels

Simple and Robust Loss Design for Multi-Label Learning with Missing Labels Official PyTorch Implementation of the paper Simple and Robust Loss Design

Xinyu Huang 28 Oct 27, 2022
On Evaluation Metrics for Graph Generative Models

On Evaluation Metrics for Graph Generative Models Authors: Rylee Thompson, Boris Knyazev, Elahe Ghalebi, Jungtaek Kim, Graham Taylor This is the offic

13 Jan 07, 2023
利用python脚本实现微信、支付宝账单的合并,并保存到excel文件实现自动记账,可查看可视化图表。

KeepAccounts_v2.0 KeepAccounts.exe和其配套表格能够实现微信、支付宝官方导出账单的读取合并,为每笔帐标记类型,并按月份和类型生成可视化图表。再也不用消费一笔记一笔,每月仅需10分钟,记好所有的帐。 作者: MickLife Bilibili: https://spac

159 Jan 01, 2023
3D-CariGAN: An End-to-End Solution to 3D Caricature Generation from Normal Face Photos

3D-CariGAN: An End-to-End Solution to 3D Caricature Generation from Normal Face Photos This repository contains the source code and dataset for the pa

54 Oct 09, 2022
Fuwa-http - The http client implementation for the fuwa eco-system

Fuwa HTTP The HTTP client implementation for the fuwa eco-system Example import

Fuwa 2 Feb 16, 2022
Deep Two-View Structure-from-Motion Revisited

Deep Two-View Structure-from-Motion Revisited This repository provides the code for our CVPR 2021 paper Deep Two-View Structure-from-Motion Revisited.

Jianyuan Wang 145 Jan 06, 2023
Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Semi-supervised-learning-for-medical-image-segmentation. Recently, semi-supervised image segmentation has become a hot topic in medical image computin

Healthcare Intelligence Laboratory 1.3k Jan 03, 2023
Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

yueliu1999 297 Dec 27, 2022