Amazon Multilingual Counterfactual Dataset (AMCD)

Last update: Sep 20, 2022

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

This repository contains a dataset described in the paper:

I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews. James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, Danushka Bollegala. EMNLP'21. arxiv version

The dataset contains sentences from Amazon customer reviews (sampled from Amazon product review dataset) annotated for counterfactual detection (CFD) binary classification. Counterfactual statements describe events that did not or cannot take place. Counterfactual statements may be identified as statements of the form – If p was true, then q would be true (i.e. assertions whose antecedent (p) and consequent (q) are known or assumed to be false).

The key features of this dataset are:

The dataset is multilingual and contains sentences in English, German, and Japanese.
The labeling was done by professional linguists and high quality was ensured.
The dataset is supplemented with the annotation guidelines and definitions, which were worked out by professional linguists. We also provide the clue word lists, which are typical for counterfactual sentences and were used for initial data filtering. The clue word lists were also compiled by professional linguists.

Please see paper for the data statistics, detailed description of data collection and annotation.

For the dataset format please see README.txt.

Cite

If you use this dataset in your research, please cite the paper.

License Summary

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

Amazon Multilingual Counterfactual Dataset (AMCD)

Related tags

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

Cite

License Summary

Owner

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Awesome Treasure of Transformers Models Collection

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Model parallel transformers in JAX and Haiku

Intent parsing and slot filling in PyTorch with seq2seq + attention

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Python package for performing Entity and Text Matching using Deep Learning.

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Mlcode - Continuous ML API Integrations

Open-source offline translation library written in Python. Uses OpenNMT for translations

Sequence modeling benchmarks and temporal convolutional networks

Search msDS-AllowedToActOnBehalfOfOtherIdentity

Pipeline for training LSA models using Scikit-Learn.

Tool which allow you to detect and translate text.

Tools to download and cleanup Common Crawl data

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Repository for Project Insight: NLP as a Service

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Spam filtering made easy for you

German Text-To-Speech Engine using Tacotron and Griffin-Lim