An Unsupervised Detection Framework for Chinese Jargons in the Darknet

This repo is the Python 3 implementation of 《An Unsupervised Detection Framework for Chinese Jargons in the Darknet》 (Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM ’22).

Introduction

This project proposes Chinese jargon detection framework based on unsupervised learning.

Requirements

pip install -r requirements.txt

Data

Due to the sensitivity of the darknet information, we will not distribute the dataset directly, we show some samples of dataset in /dataset/sample.csv and we will leave the contact information for readers to request for Raw Corpus.

Please contact Liang Ke ([email protected]) for the Darknet corpus dataset.
The Modern Chinese Dictionary (the 7th edition) that we used for cross-corpus comparison is from here.

Code

Preprocess the raw corpus using preprocess.py and get the clean corpus.
Find out-of-vocabulary words using newWordsDiscovey.py, and add them to tokenizer dictionary.
Pretrain word-based DC-BERT model with clean corpus using pretrain.py.
Generate word embeddings with pretrained DC-BERT using genEmbedding.py.
Consruct seed criminal keywords with findSeedKeywords.py, we show an example of a list of seed criminal keywords for readers to reference, you can either delete or add words related to your task.
Find jargon candidates (words related to relevant cybercrimes and are very likely to be jargons) with findCandidate.py.
Finally, you can obtain real darknet Chinese jargons detected by our framework using findJargon.py.

Citation

waiting for camera-ready

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Related tags

Overview

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Introduction

Requirements

Data

Code

Citation

Owner

YOLOX-CondInst - Implement CondInst which is a instances segmentation method on YOLOX

Pytorch-3dunet - 3D U-Net model for volumetric semantic segmentation written in pytorch

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis in JAX

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

MazeRL is an application oriented Deep Reinforcement Learning (RL) framework

Differentiable Annealed Importance Sampling (DAIS)

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder

DSL for matching Python ASTs

Users can free try their models on SIDD dataset based on this code

Implementation for "Seamless Manga Inpainting with Semantics Awareness" (SIGGRAPH 2021 issue)

Implementing DeepMind's Fast Reinforcement Learning paper

Unified MultiWOZ evaluation scripts for the context-to-response task.

Mmrotate - OpenMMLab Rotated Object Detection Benchmark

use machine learning to recognize gesture on raspberrypi

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

This repository is the offical Pytorch implementation of ContextPose: Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021).

Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

Official code of Team Yao at Multi-Modal-Fact-Verification-2022

Deep Learning & 3D Convolutional Neural Networks for Speaker Verification