Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Overview

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

This repository contains code and data for evaluating model performance in crosslinguistic low-resource settings, using morphological segmentation as the test case. For more information, we refer to the paper Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation, to appear in Transactions of the Association for Computational Linguistics.

Arxiv version here

@misc{liu2022datadriven,
      title={Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation}, 
      author={Zoey Liu and Emily Prud'hommeaux},
      year={2022},
      eprint={2201.01845},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

Install the following:

(1) Python 3

(2) Morfessor

(3) CRFsuite

(4) OpenNMT

Code

The code directory contains the code applied to conduct the experiments.

Collect initial data

Create a resource folder. This folder is supposed to hold the initial data for each language invited to participate in the experiments. The experiments were performed at different stages, therefore the initial data of different languages have different subdirectories within resource (please excuse this).

The data for three Mexican languages came from this paper.

(1) download the data from the public repository

(2) for each language, combine all the data from the training, development, and test set; this applies to both the *src files and the *tgt files.

(3) rename the combined data file as, e.g., Yorem Nokki: mayo_src, mayo_tgt, Nahuatl: nahuatl_src, nahuatl_tgt.

(4) put the data files within resource

The data for Persian came from here.

(1) download the data from the public repository

(2) combine the training, development, and test set to one data file

(3) rename the combined data file as persian

(4) put the single data file within resource

The data for German, Zulu and Indonesian came from this paper.

(1) download the data from the public repository

(2) put the downloaded supplement folder within resource

The data for English, Russian, Turkish and Finnish came from this repo.

(1) download the git repo

(2) put the downloaded NeuralMorphemeSegmentation folder within resource

Summary of (alternative) Language codes and data directories for running experiments

Yorem Nokki: mayo resources/

Nahuatl: nahuatl resources/

Wixarika: wixarika resources/

English: english/eng resources/NeuralMorphemeSegmentation/morphochal10data/

German: german/ger resources/supplement/seg/ger

Persian: persian resources/

Russian: russian/ru resources/NeuralMorphemeSegmentation/data/

Turkish: turkish/tur resources/NeuralMorphemeSegmentation/morphochal10data/

Finnish: finnish/fin resources/NeuralMorphemeSegmentation/morphochal10data/

Zulu: zulu/zul resources/supplement/seg/zul

Indonesian: indonesian/ind resources/supplement/seg/ind

Basic running of the code

Create experiments folder and subfolders for each language; e.g., Zulu

mkdir experiments

mkdir zulu

Generate data (an example)

with replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r with --k 500

without replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r without --k 500

Training models: Morfessor

Train morfessor models

python3 code/morfessor/morfessor.py --input experiments/zulu/500/with/ --lang zul

python3 code/morfessor/morfessor.py --input experiments/zulu/500/without/ --lang zul

Generate evaluation scrips for morfessor model results

python3 code/morf_shell.py --input experiments/zulu/500/ --lang zul

Evaluate morfessor model results

bash zulu_500_morf_eval.sh

Training models: CRF

Generate CRF shell script

e.g., generating 3-CRF shell script

python3 code/crf_order.py --input experiments/zulu/500/ --lang zul --r with --order 3

Training models: Seq2seq

Generate configuration .yaml files

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r with

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r without

Generate pbs file (containing also the code to train Seq2seq model)

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r with

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r without

Gather training results for a given language

Again take Zulu as an example. Make sure that given a data set size (e.g, 500) and a sampling method (e.g., with replacement), there are three subfolders in the folder experiments/zulu/500/with:

(1) morfessor for all *eval* files from Morfessor;

(2) higher_orders for all *eval* files from k-CRF;

(3) seq2seq for all *eval* files from Seq2seq

Then run:

python3 code/gather.py --input experiments/zulu/ --lang zul --short zulu.txt --full zulu_full.txt --long zulu_details.txt

Testing

Testing the best CRF

e.g., 4-CRFs trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_crf.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --order 4 --r with --k 50

Testing the best Seq2seq

e.g., trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_seq2seq.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --r with --k 50

Do the same for every language

Generating alternative splits

Gather features of data sets, as well as generate heuristic/adversarial data splits

python3 code/heuristics.py --input experiments/zulu/ --lang zul --output yayyy/ --split A --generate

Gather features of new unseen test sets

python3 code/new_test_heuristics.py --input experiments/zulu/ --output yayyy/ --lang zul

Yayyy: Full Results

Get them here

Running analyses and making plots

See code/plot.R for analysis and making fun plots

Owner
Zoey Liu
language, computation, music, food
Zoey Liu
[NeurIPS '21] Adversarial Attacks on Graph Classification via Bayesian Optimisation (GRABNEL)

Adversarial Attacks on Graph Classification via Bayesian Optimisation @ NeurIPS 2021 This repository contains the official implementation of GRABNEL,

Xingchen Wan 12 Dec 23, 2022
Unofficial implementation of Fast-SCNN: Fast Semantic Segmentation Network

Fast-SCNN: Fast Semantic Segmentation Network Unofficial implementation of the model architecture of Fast-SCNN. Real-time Semantic Segmentation and mo

Philip Popien 69 Aug 11, 2022
A TensorFlow implementation of DeepMind's WaveNet paper

A TensorFlow implementation of DeepMind's WaveNet paper This is a TensorFlow implementation of the WaveNet generative neural network architecture for

Igor Babuschkin 5.3k Dec 28, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences an

Microsoft 8k Jan 04, 2023
Face Mask Detector by live camera using tensorflow-keras, openCV and Python

Face Mask Detector 😷 by Live Camera Detecting masked or unmasked faces by live camera with percentange of mask occupation About Project: This an Arti

Karan Shingde 2 Apr 04, 2022
An implementation of the [Hierarchical (Sig-Wasserstein) GAN] algorithm for large dimensional Time Series Generation

Hierarchical GAN for large dimensional financial market data Implementation This repository is an implementation of the [Hierarchical (Sig-Wasserstein

11 Nov 29, 2022
An official PyTorch implementation of the TKDE paper "Self-Supervised Graph Representation Learning via Topology Transformations".

Self-Supervised Graph Representation Learning via Topology Transformations This repository is the official PyTorch implementation of the following pap

Hsiang Gao 2 Oct 31, 2022
Source code for From Stars to Subgraphs

GNNAsKernel Official code for From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness Visualizations GNN-AK(+) GNN-AK(+) with Subgra

44 Dec 19, 2022
PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Maximum Entropy Generators for Energy-Based Models All experiments have tensorboard visualizations for samples / density / train curves etc. To run th

Rithesh Kumar 135 Oct 27, 2022
PyTorch 1.0 inference in C++ on Windows10 platforms

Serving PyTorch Models in C++ on Windows10 platforms How to use Prepare Data examples/data/train/ - 0 - 1 . . . - n examples/data/test/

Henson 88 Oct 15, 2022
EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration Ruikang Xu, Zeyu Xiao, Jie Huang, Yueyi Zhang, Zhiwei Xiong. EDPN: Enhanced Deep Pyra

69 Dec 15, 2022
(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

xxxnell 656 Dec 30, 2022
Self-Regulated Learning for Egocentric Video Activity Anticipation

Self-Regulated Learning for Egocentric Video Activity Anticipation Introduction This is a Pytorch implementation of the model described in our paper:

qzhb 13 Sep 23, 2022
Temporal Segment Networks (TSN) in PyTorch

TSN-Pytorch We have released MMAction, a full-fledged action understanding toolbox based on PyTorch. It includes implementation for TSN as well as oth

1k Jan 03, 2023
Research code of ICCV 2021 paper "Mesh Graphormer"

MeshGraphormer ✨ ✨ This is our research code of Mesh Graphormer. Mesh Graphormer is a new transformer-based method for human pose and mesh reconsructi

Microsoft 251 Jan 08, 2023
Styled text-to-drawing synthesis method. Featured at the 2021 NeurIPS Workshop on Machine Learning for Creativity and Design

Styled text-to-drawing synthesis method. Featured at the 2021 NeurIPS Workshop on Machine Learning for Creativity and Design

Peter Schaldenbrand 247 Dec 23, 2022
SMD-Nets: Stereo Mixture Density Networks

SMD-Nets: Stereo Mixture Density Networks This repository contains a Pytorch implementation of "SMD-Nets: Stereo Mixture Density Networks" (CVPR 2021)

Fabio Tosi 115 Dec 26, 2022
Sparse Physics-based and Interpretable Neural Networks

Sparse Physics-based and Interpretable Neural Networks for PDEs This repository contains the code and manuscript for research done on Sparse Physics-b

28 Jan 03, 2023
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation Where we are ? 12.27 目前和原论文仍有1%左右得差距,但已经力压很多SOTA了 ckpt__448_epoch_25.pth mIoU

zichengsaber 60 Dec 11, 2022
Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Info This is the code repository of the work Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation from Elias T

2 Apr 20, 2022