A cross-lingual COVID-19 fake news dataset

Overview

CrossFake

An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below:
Cross-lingual COVID-19 Fake News Detection.
Jiangshu Du, Yingtong Dou, Congying Xia, Limeng Cui, Jing Ma, Philip S. Yu.

Introduction

The COVID-19 pandemic poses a significant threat to global public health. Meanwhile, there is massive misinformation associated with the pandemic, which advocates unfounded or unscientific claims. Even major social media and news outlets have made an extra effort in debunking COVID-19 misinformation, most of the fact-checking information is in English, whereas some unmoderated COVID-19 misinformation is still circulating in other languages, threatening the health of less informed people in immigrant communities and developing countries (The Vox, New York Times).

In the above paper, we make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English).

This repo contains a Chinese-English real & fake news dataset according to existing English fact-checking information. Details on this dataset are described in Dataset Detail.

The highlights of our dataset are as follows:

  • Bilingual news pieces for the same event (fact).
  • Multiple Chinese news pieces for the same event (fact).
  • Comprehensive metadata for each news (see below).

Dataset Detail

The table below shows the number of annotated news in each language:

Lang. Fake Real Total
ENG 55 82 137
CHN 101 118 219

The metadata of our dataset can be found at CrossFake_metadata.xlsx, which includes two sheets (news_fake and news_real). Given the news id, you can find the corresponding news body text in the body_text directory. The meanings of each column of the metadata are shown below:

  • Column A (id):

    News id. Chinese real & fake news is annotated according to existing English fact-checking information. Thus, each piece of English news may correspond to multiple pieces of Chinese news from different sources. For example, in the news_fake sheet, the ids 1_1 and 1_2 indicate one piece of English news, corresponding to two pieces of Chinese news.

  • Column B (fact_check_url):

    The fact-checking source of the corresponding English news.

  • Column C (type):

    The news type. Post and Article represent the news is from a social media post or an online article, respectively. Note that we also annotated some clickbait news whose title and body text present contradictory information.

  • Column D (source):

    The news source. Personal and Professional represent the news is from a personal account or professional source (WHO, NIH, etc.), respectively.

  • Column E (mixed?):

    Whether the news include mixed content? If a news body text only has the content related to the checked fact, the piece of news is annotated as not mixed. Accordingly, the news whose content includes events/facts besides the checked fact is regarded as mixed news.

  • Column F (platform):

    The platform where the news is published.

  • Column G (news_url):

    The news source URL. Note that some of the links are invalid due to the deletion/removal of the news. We have archived the accessible news (see Column H) during we curate the dataset.

  • Column H (archive):

    The archived news link. To permanently store the original news, we archived the news source URL.

  • Column I (newstitle):

    The news title.

  • Column J (publish_date):

    The news publishing date.

  • Columns K to R have the same meanings as Columns C to J, but they indicate the information of Chinese news.

Case Study

Besides the findings and conclusions presented in our paper. We have extra interesting findings during collecting the data:

  1. Mixed Fact. For some fake news, their corresponding Chinese news articles presented them in the form of a news digest with other news events. It brings an extra hurdle to fact-check those news pieces since only partial content of the news contains misinformation. A typical example is news_id 8_3 in the news_fake sheet. You can check out other news whose mixed? annotated as Yes.

  2. Misused Fact. For news_real id 9_2, we find a Chinese social post leveraging the fact that "coronavirus can live for up to 4 hours on copper" to promote their copper-made pot. In this case, even the title and most of the news content seem legit, but the connection between "the copper kills coronavirus" and "copper pot is good" is still questionable.

  3. Fake News Type. During we annotate the Chinese news based on the fact-checked English news. We find that most of the fact-checked fake news from Politifact have no corresponding Chinese news. Those news pieces usually are local news in the United States.

  4. Cross-lingual Fact-checking. For the news_real id 9_1, we find a Chinese news piece from a professional news outlet published five days earlier than the fact-checked English Facebook post. It suggests that we could leverage fact information from another language to help fact-check the news. Note that most of the Chinese news in our datasets are published later than the source English news since most of the checked news events are originated in English media.

Future Directions

Given the current dataset, some future research directions include:

  • The writing style/sentiment/stance differences between fake news and real news.
  • The writing style/sentiment/stance differences between professional news outlets and personal accounts.
  • The information distortion/loss from English news to Chinese news.
  • The temporal patterns of cross-lingual news migration.
  • The title patterns of different news.

Citation

If you use our code, please cite the paper below:

@inproceedings{du2021cross,
  title={Cross-lingual COVID-19 Fake News Detection},
  author={Du, Jiangshu and Dou, Yingtong and Xia, Congying and Cui, Limeng and Ma, Jing and Yu, Philip S},
  booktitle={Proceedings of the 21st IEEE International Conference on Data Mining Workshops (ICDMW'21)},
  year={2021}
}
Owner
Yingtong Dou
Ph.D. @ UIC. Graph Mining; Fraud Detection; Secure Machine Learning
Yingtong Dou
Codebase for ECCV18 "The Sound of Pixels"

Sound-of-Pixels Codebase for ECCV18 "The Sound of Pixels". *This repository is under construction, but the core parts are already there. Environment T

Hang Zhao 318 Dec 20, 2022
Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation

SUCP Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation () Direct Friends (i.e., users who follow each o

Kosar 8 Nov 26, 2022
This repo is official PyTorch implementation of MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices(CVPRW 2021).

Github Code of "MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices" Introduction This repo is official PyTorch implementatio

Choi Sang Bum 203 Jan 05, 2023
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 04, 2023
Line-level Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Line-level Handwritten Text Recognition with TensorFlow This model is an extended version of the Simple HTR system implemented by @Harald Scheidl and

Hoàng Tùng Lâm (Linus) 72 May 07, 2022
This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Polarized Self-Attention: Towards High-quality Pixel-wise Regression This is an official implementation of: Huajun Liu, Fuqiang Liu, Xinyi Fan and Don

DeLightCMU 212 Jan 08, 2023
Music library streaming app written in Flask & VueJS

djtaytay This is a little toy app made to explore Vue, brush up on my Python, and make a remote music collection accessable through a web interface. I

Ryan Tasson 6 May 27, 2022
PyTorch implementation of neural style transfer algorithm

neural-style-pt This is a PyTorch implementation of the paper A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias

770 Jan 02, 2023
This repository contains code to train and render Mixture of Volumetric Primitives (MVP) models

Mixture of Volumetric Primitives -- Training and Evaluation This repository contains code to train and render Mixture of Volumetric Primitives (MVP) m

Meta Research 125 Dec 29, 2022
Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

TTNet-Pytorch The implementation for the paper "TTNet: Real-time temporal and spatial video analysis of table tennis" An introduction of the project c

Nguyen Mau Dung 438 Dec 29, 2022
PointPillars inference with TensorRT

A project demonstrating how to use CUDA-PointPillars to deal with cloud points data from lidar.

NVIDIA AI IOT 315 Dec 31, 2022
CS583: Deep Learning

CS583: Deep Learning

Shusen Wang 2.6k Dec 30, 2022
Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Implicit3DUnderstanding (Im3D) [Project Page] Holistic 3D Scene Understanding from a Single Image with Implicit Representation Cheng Zhang, Zhaopeng C

Cheng Zhang 149 Jan 08, 2023
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
ICSS - Interactive Continual Semantic Segmentation

Presentation This repository contains the code of our paper: Weakly-supervised c

Alteia 9 Jul 23, 2022
PRTR: Pose Recognition with Cascade Transformers

PRTR: Pose Recognition with Cascade Transformers Introduction This repository is the official implementation for Pose Recognition with Cascade Transfo

mlpc-ucsd 133 Dec 30, 2022
Automatic deep learning for image classification.

AutoDL AutoDL automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few line

wenqi 2 Oct 12, 2022
Official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR)

This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.

12 Jan 13, 2022
The repository forked from NVlabs uses our data. (Differentiable rasterization applied to 3D model simplification tasks)

nvdiffmodeling [origin_code] Differentiable rasterization applied to 3D model simplification tasks, as described in the paper: Appearance-Driven Autom

Qiujie (Jay) Dong 2 Oct 31, 2022
PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering Jang Hyun Cho1, Utkarsh Mall2, Kavita Bala2, Bharath Harihar

Jang Hyun Cho 164 Dec 30, 2022