GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Last update: Dec 27, 2022

Overview

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Original implementation for paper GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training.

GCC is a contrastive learning framework that implements unsupervised structural graph representation pre-training and achieves state-of-the-art on 10 datasets on 3 graph mining tasks.

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Installation

Requirements

Linux with Python ≥ 3.6
PyTorch ≥ 1.4.0
0.5 > DGL ≥ 0.4.3
pip install -r requirements.txt
Install RDKit with conda install -c conda-forge rdkit=2019.09.2.

Quick Start

Pretraining

Pre-training datasets

python scripts/download.py --url https://drive.google.com/open?id=1JCHm39rf7HAJSp-1755wa32ToHCn2Twz --path data --fname small.bin
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/b37eed70207c468ba367/?dl=1 --path data --fname small.bin

E2E

Pretrain E2E with K = 255:

bash scripts/pretrain.sh <gpu> --batch-size 256

MoCo

Pretrain MoCo with K = 16384; m = 0.999:

bash scripts/pretrain.sh <gpu> --moco --nce-k 16384

Download Pretrained Models

Instead of pretraining from scratch, you can download our pretrained models.

python scripts/download.py --url https://drive.google.com/open?id=1lYW_idy9PwSdPEC7j9IH5I5Hc7Qv-22- --path saved --fname pretrained.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/cabec37002a9446d9b20/?dl=1 --path saved --fname pretrained.tar.gz

Downstream Tasks

Downstream datasets

python scripts/download.py --url https://drive.google.com/open?id=12kmPV3XjVufxbIVNx5BQr-CFM9SmaFvM --path data --fname downstream.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/2535437e896c4b73b6bb/?dl=1 --path data --fname downstream.tar.gz

Generate embeddings on multiple datasets with

bash scripts/generate.sh <gpu> <load_path> <dataset_1> <dataset_2> ...

For example:

bash scripts/generate.sh 0 saved/Pretrain_moco_True_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_32_hid_64_samples_2000_nce_t_0.07_nce_k_16384_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999/current.pth usa_airport kdd imdb-binary

Node Classification

Unsupervised (Table 2 freeze)

Run baselines on multiple datasets with bash scripts/node_classification/baseline.sh <hidden_size> <baseline:prone/graphwave> usa_airport h-index.

Evaluate GCC on multiple datasets:

bash scripts/generate.sh <gpu> <load_path> usa_airport h-index
bash scripts/node_classification/ours.sh <load_path> <hidden_size> usa_airport h-index

Supervised (Table 2 full)

Finetune GCC on multiple datasets:

bash scripts/finetune.sh <load_path> <gpu> usa_airport

Note this finetunes the whole network and will take much longer than the freezed experiments above.

Graph Classification

Unsupervised (Table 3 freeze)

bash scripts/generate.sh <gpu> <load_path> imdb-binary imdb-multi collab rdt-b rdt-5k
bash scripts/graph_classification/ours.sh <load_path> <hidden_size> imdb-binary imdb-multi collab rdt-b rdt-5k

Supervised (Table 3 full)

bash scripts/finetune.sh <load_path> <gpu> imdb-binary

Similarity Search (Table 4)

Run baseline (graphwave) on multiple datasets with bash scripts/similarity_search/baseline.sh <hidden_size> graphwave kdd_icdm sigir_cikm sigmod_icde.

Run GCC:

bash scripts/generate.sh <gpu> <load_path> kdd icdm sigir cikm sigmod icde
bash scripts/similarity_search/ours.sh <load_path> <hidden_size> kdd_icdm sigir_cikm sigmod_icde

❗ Common Issues

"XXX file not found" when running pretraining/downstream tasks.

Please make sure you've downloaded the pretraining dataset or downstream task datasets according to GETTING_STARTED.md.

Server crashes/hangs after launching pretraining experiments.

In addition to GPU, our pretraining stage requires a lot of computation resources, including CPU and RAM. If this happens, it usually means the CPU/RAM is exhausted on your machine. You can decrease `--num-workers` (number of dataloaders using CPU) and `--num-copies` (number of datasets copies residing in RAM). With the lowest profile, try `--num-workers 1 --num-copies 1`.

If this still fails, please upgrade your machine :). In the meanwhile, you can still download our pretrained model and evaluate it on downstream tasks.

Having difficulty installing RDKit.

See the P.S. section in [this](https://github.com/THUDM/GCC/issues/12#issue-752080014) post.

Citing GCC

If you use GCC in your research or wish to refer to the baseline results, please use the following BibTeX.

@article{qiu2020gcc,
  title={GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training},
  author={Qiu, Jiezhong and Chen, Qibin and Dong, Yuxiao and Zhang, Jing and Yang, Hongxia and Ding, Ming and Wang, Kuansan and Tang, Jie},
  journal={arXiv preprint arXiv:2006.09963},
  year={2020}
}

Acknowledgements

Part of this code is inspired by Yonglong Tian et al.'s CMC: Contrastive Multiview Coding.

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Related tags

Overview

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Installation

Requirements

Quick Start

Pretraining

Pre-training datasets

E2E

MoCo

Download Pretrained Models

Downstream Tasks

Downstream datasets

Node Classification

Unsupervised (Table 2 freeze)

Supervised (Table 2 full)

Graph Classification

Unsupervised (Table 3 freeze)

Supervised (Table 3 full)

Similarity Search (Table 4)

❗ Common Issues

Citing GCC

Acknowledgements

Owner

THUDM

This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment.

Cweqgen - The CW Equation Generator

List of content farm sites like g.penzai.com.

Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe

Source code of D-HAN: Dynamic News Recommendation with Hierarchical Attention Network

ICCV2021 Paper: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

A lightweight python AUTOmatic-arRAY library.

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

Code for DeepCurrents: Learning Implicit Representations of Shapes with Boundaries

2021 Artificial Intelligence Diabetes Datathon

Official PyTorch implementation of "Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets" (ICLR 2021)

Location-Sensitive Visual Recognition with Cross-IOU Loss

RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds

source code of Adversarial Feedback Loop Paper

Implementation of the CVPR 2021 paper "Online Multiple Object Tracking with Cross-Task Synergy"

Preprossing-loan-data-with-NumPy - In this project, I have cleaned and pre-processed the loan data that belongs to an affiliate bank based in the United States.

An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities.

Computations and statistics on manifolds with geometric structures.

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features