Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Overview

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

This repository contains code and data for evaluating model performance in crosslinguistic low-resource settings, using morphological segmentation as the test case. For more information, we refer to the paper Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation, to appear in Transactions of the Association for Computational Linguistics.

Arxiv version here

@misc{liu2022datadriven,
      title={Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation}, 
      author={Zoey Liu and Emily Prud'hommeaux},
      year={2022},
      eprint={2201.01845},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

Install the following:

(1) Python 3

(2) Morfessor

(3) CRFsuite

(4) OpenNMT

Code

The code directory contains the code applied to conduct the experiments.

Collect initial data

Create a resource folder. This folder is supposed to hold the initial data for each language invited to participate in the experiments. The experiments were performed at different stages, therefore the initial data of different languages have different subdirectories within resource (please excuse this).

The data for three Mexican languages came from this paper.

(1) download the data from the public repository

(2) for each language, combine all the data from the training, development, and test set; this applies to both the *src files and the *tgt files.

(3) rename the combined data file as, e.g., Yorem Nokki: mayo_src, mayo_tgt, Nahuatl: nahuatl_src, nahuatl_tgt.

(4) put the data files within resource

The data for Persian came from here.

(1) download the data from the public repository

(2) combine the training, development, and test set to one data file

(3) rename the combined data file as persian

(4) put the single data file within resource

The data for German, Zulu and Indonesian came from this paper.

(1) download the data from the public repository

(2) put the downloaded supplement folder within resource

The data for English, Russian, Turkish and Finnish came from this repo.

(1) download the git repo

(2) put the downloaded NeuralMorphemeSegmentation folder within resource

Summary of (alternative) Language codes and data directories for running experiments

Yorem Nokki: mayo resources/

Nahuatl: nahuatl resources/

Wixarika: wixarika resources/

English: english/eng resources/NeuralMorphemeSegmentation/morphochal10data/

German: german/ger resources/supplement/seg/ger

Persian: persian resources/

Russian: russian/ru resources/NeuralMorphemeSegmentation/data/

Turkish: turkish/tur resources/NeuralMorphemeSegmentation/morphochal10data/

Finnish: finnish/fin resources/NeuralMorphemeSegmentation/morphochal10data/

Zulu: zulu/zul resources/supplement/seg/zul

Indonesian: indonesian/ind resources/supplement/seg/ind

Basic running of the code

Create experiments folder and subfolders for each language; e.g., Zulu

mkdir experiments

mkdir zulu

Generate data (an example)

with replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r with --k 500

without replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r without --k 500

Training models: Morfessor

Train morfessor models

python3 code/morfessor/morfessor.py --input experiments/zulu/500/with/ --lang zul

python3 code/morfessor/morfessor.py --input experiments/zulu/500/without/ --lang zul

Generate evaluation scrips for morfessor model results

python3 code/morf_shell.py --input experiments/zulu/500/ --lang zul

Evaluate morfessor model results

bash zulu_500_morf_eval.sh

Training models: CRF

Generate CRF shell script

e.g., generating 3-CRF shell script

python3 code/crf_order.py --input experiments/zulu/500/ --lang zul --r with --order 3

Training models: Seq2seq

Generate configuration .yaml files

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r with

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r without

Generate pbs file (containing also the code to train Seq2seq model)

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r with

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r without

Gather training results for a given language

Again take Zulu as an example. Make sure that given a data set size (e.g, 500) and a sampling method (e.g., with replacement), there are three subfolders in the folder experiments/zulu/500/with:

(1) morfessor for all *eval* files from Morfessor;

(2) higher_orders for all *eval* files from k-CRF;

(3) seq2seq for all *eval* files from Seq2seq

Then run:

python3 code/gather.py --input experiments/zulu/ --lang zul --short zulu.txt --full zulu_full.txt --long zulu_details.txt

Testing

Testing the best CRF

e.g., 4-CRFs trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_crf.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --order 4 --r with --k 50

Testing the best Seq2seq

e.g., trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_seq2seq.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --r with --k 50

Do the same for every language

Generating alternative splits

Gather features of data sets, as well as generate heuristic/adversarial data splits

python3 code/heuristics.py --input experiments/zulu/ --lang zul --output yayyy/ --split A --generate

Gather features of new unseen test sets

python3 code/new_test_heuristics.py --input experiments/zulu/ --output yayyy/ --lang zul

Yayyy: Full Results

Get them here

Running analyses and making plots

See code/plot.R for analysis and making fun plots

Owner
Zoey Liu
language, computation, music, food
Zoey Liu
Tensorflow2 Keras-based Semantic Segmentation Models Implementation

Tensorflow2 Keras-based Semantic Segmentation Models Implementation

Hah Min Lew 1 Feb 08, 2022
General Assembly Capstone: NBA Game Predictor

Project 6: Predicting NBA Games Problem Statement Can I predict the results of NBA games from the back-half of a season from the opening half of the s

Adam Muhammad Klesc 1 Jan 14, 2022
Official PyTorch implementation of Segmenter: Transformer for Semantic Segmentation

Segmenter: Transformer for Semantic Segmentation Segmenter: Transformer for Semantic Segmentation by Robin Strudel*, Ricardo Garcia*, Ivan Laptev and

594 Jan 06, 2023
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
Pseudo lidar - (CVPR 2019) Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving This paper has been accpeted by Conference o

Yan Wang 881 Dec 27, 2022
robomimic: A Modular Framework for Robot Learning from Demonstration

robomimic [Homepage]   [Documentation]   [Study Paper]   [Study Website]   [ARISE Initiative] Latest Updates [08/09/2021] v0.1.0: Initial code and pap

ARISE Initiative 178 Jan 05, 2023
GBIM(Gesture-Based Interaction map)

手势交互地图 GBIM(Gesture-Based Interaction map),基于视觉深度神经网络的交互地图,通过电脑摄像头观察使用者的手势变化,进而控制地图进行简单的交互。网络使用PaddleX提供的轻量级模型PPYOLO Tiny以及MobileNet V3 small,使得整个模型大小约10MB左右,即使在CPU下也能快速定位和识别手势。

8 Feb 10, 2022
A Python library for Deep Probabilistic Modeling

Abstract DeeProb-kit is a Python library that implements deep probabilistic models such as various kinds of Sum-Product Networks, Normalizing Flows an

DeeProb-org 46 Dec 26, 2022
The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"

P2PNet (ICCV2021 Oral Presentation) This repository contains codes for the official implementation in PyTorch of P2PNet as described in Rethinking Cou

Tencent YouTu Research 208 Dec 26, 2022
Neural network for stock price prediction

neural_network_for_stock_price_prediction Neural networks for stock price predic

2 Feb 04, 2022
A framework for GPU based high-performance medical image processing and visualization

FAST is an open-source cross-platform framework with the main goal of making it easier to do high-performance processing and visualization of medical images on heterogeneous systems utilizing both mu

Erik Smistad 315 Dec 30, 2022
(to be released) [NeurIPS'21] Transformers Generalize DeepSets and Can be Extended to Graphs and Hypergraphs

Higher-Order Transformers Kim J, Oh S, Hong S, Transformers Generalize DeepSets and Can be Extended to Graphs and Hypergraphs, NeurIPS 2021. [arxiv] W

Jinwoo Kim 44 Dec 28, 2022
Expert Finding in Legal Community Question Answering

Expert Finding in Legal Community Question Answering Arian Askari, Suzan Verberne, and Gabriella Pasi. Expert Finding in Legal Community Question Answ

Arian Askari 3 Oct 31, 2022
Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021

SNN_Calibration Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021 Feature Comparison of SNN calibration: Features SNN Direct Tr

Yuhang Li 60 Dec 27, 2022
Pipeline for employing a Lightweight deep learning models for LOW-power systems

PL-LOW A high-performance deep learning model lightweight pipeline that gradually lightens deep neural networks in order to utilize high-performance d

POSTECH Data Intelligence Lab 9 Aug 13, 2022
Simple image captioning model - CLIP prefix captioning.

CLIP prefix captioning. Inference Notebook: 🥳 New: 🥳 Our technical papar is finally out! Official implementation for the paper "ClipCap: CLIP Prefix

688 Jan 04, 2023
[ WSDM '22 ] On Sampling Collaborative Filtering Datasets

On Sampling Collaborative Filtering Datasets This repository contains the implementation of many popular sampling strategies, along with various expli

Noveen Sachdeva 17 Dec 08, 2022
This repository contains the scripts for downloading and validating scripts for the documents

HC4: HLTCOE CLIR Common-Crawl Collection This repository contains the scripts for downloading and validating scripts for the documents. Document ids,

JHU Human Language Technology Center of Excellence 6 Jun 07, 2022
METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

Zi-Yi Dou 257 Jan 06, 2023
Pytorch implementation of our paper accepted by NeurIPS 2021 -- Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021) (Link) Overview Prerequisites Linu

Shaojie Li 34 Mar 31, 2022