OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Last update: Sep 09, 2022

Related tags

Overview

OceanScript Esoteric Language

Overview

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

Unlike it's prior versions, OceanScript supports any character, as well as capitalization. Your encoded string should be decoded to look exactly the same as the encoded string. Note, however, that outlying whitespace characters are stripped from the edges of the text.

How does it work?

OceanScript isn't just random choice or a random jumble of characters. These characters have very specific meanings, which once understood, can be used to write oceanscript without the use of the encoder. Take a look at these 4 tables below:

ㅤ	`<`	`-`	`>`	ㅤ
`^`	a	b	c	`.`
`~`	d	e	f	`.`
`_`	g	h	i	`.`
ㅤ	`<`	`-`	`>`
`^`	j	k	l	`..`
`~`	m	n	o	`..`
`_`	p	q	r	`..`
ㅤ	`<`	`-`	`>`	ㅤ
`^`	s	t	u	`...`
`~`	v	w	x	`...`
`_`	y	z	0	`...`
ㅤ	`<`	`-`	`>`	ㅤ
`^`	1	2	3	`....`
`~`	4	5	6	`....`
`_`	7	8	9	`....`
ㅤ	`<`	`-`	`>`	ㅤ

When typing a character, you need to check the following in order:

What row is my character in? (The rows are denoted by the following characters: ^, ~, _.)
What column is my character in? (The columns are denoted by the following indicators: <, -, >.)
What box is my character in? (The boxes are denoted by ., multipled by n, where n is the box number. There are 4 boxes.)

Our final product will be known as a "wave". It will contain from 3 to 6 characters. Have a look at some examples below to understand how to write these waves.

Exemplar `a`

Here is an example of typing the character a. It is the easiest character to type from memory, and is great to use as a first example. Lets zoom in on a's box below:

ㅤ	`<`	`-`	`>`	ㅤ
`^`	a	b	c	`.`
`~`	d	e	f	`.`
`_`	g	h	i	`.`
ㅤ	`<`	`-`	`>`	ㅤ

Using the table above, the character a is in the top row of it's box, so our first character is ^. Next, you need to check the column. a is also in the first column, so our second character is < (pointing to the left). Finally, we need to check which box character a is in. We will add . according to the table above. The number of dots corresponds to the box number (1-4).

a in oceanscript is ^<..

Exemplar `x`

Here is an example of typing the character x. Lets zoom in on x's box below:

ㅤ	`<`	`-`	`>`	ㅤ
`^`	s	t	u	`...`
`~`	v	w	x	`...`
`_`	y	z	0	`...`
ㅤ	`<`	`-`	`>`	ㅤ

Using the table above, the character x is in the second row of it's box, so our first character is ~. Next, you need to check the column. x is in the right-hand column, so our second character is > (pointing to the right). Finally, we need to check which box character x is in. We will add ..., because this is box 3 of 4.

x in oceanscript is ~>....

Joining waves

Wave is the name used for a single encoded character (z -> _-...). Waves can be freely joined together. Remember that every character ends with ., so you can easily work out where each wave ends.

hello -> _-.~-.^>..^>..~>..
foobar -> ~>.~>..~>..^-.^<._>..

Joining tides

Tide is the name used for a collection of "waves", so essentially a word converted into oceanscript. For example, _-.~-.^>..^>..~>.. is a tide, meaning "hello". Tides can be joined using either commas, or line breaks. For pretty formatting, or if you have a lot of text, you should use line breaks - but otherwise go with commas.

hello foobar -> _-.~-.^>..^>..~>..,~>.~>..~>..^-.^<._>.. (notice the comma separating these two waves)

OR...

hello foobar ->

_-.~-.^>..^>..~>..

~>.~>..~>..^-.^<._>..

Special characters

Don't fear, oceanscript isn't just limited to a-z and 0-9. Well, it used to be - but a new special character indicator has been added to support any other character.

It is best to keep these special characters out of the ocean, so these characters will need to use a raft (=). Simply put the raft before the character. If you wanted to write the Greek lambda character (λ), it will need a raft seeing as its not in the large table, so it would simply be written as =λ. More common characters (., !, ?) are more likely to appear, they will need rafts too.

? -> =?
^ -> =^
... -> =.=.=.

Despite not appearing in the table, capitalized a-z characters DO NOT need to use rafts. See below about capitalization.

Capitalization

Capitalization hadn't been supported for months proceeding the initial release of oceanscript, but its now available. To make a character capital, use a splash (*) before the wave.

a -> ^<.
A -> *<.

For each capital wave you have, you will have to add a splash before each one. Yes, these waves are choppy...

hello -> _-.~-.^>..^>..~>..
HELLO -> *_-.*~-.*^>..*^>..*~>..

Note that you should not use splashes for numeric characters, or any characters that are not alphabetic. This is because these characters do not have case forms (and therefore don't produce splashes! You could call them subtle waves...)

Line breaks

For line breaks in oceanscript, use %. For example, presenting an acronym:

N
A
S
A

In oceanscript, the above acronym would be encoded into *~-..%*^<.%*^<...%*^<..

All identifiers

Identifier	Description
`,`	Represents a space
`\n`	Represents a space
`%`	Represents a line break
`=`	Creates a raft for a single character (proceeding character will be ignored by encoder)
`*`	Creates a splash for a wave (proceeding wave will be capitalized)
`^`	Denotes the top row of a box for a single wave.
`~`	Denotes the middle row of a box for a single wave.
`_`	Denotes the bottom row of a box for a single wave.
`<`	Denotes the left-handed column of a box for a single wave.
`-`	Denotes the central column of a box for a single wave.
`>`	Denotes the right-handed column of a box for a single wave.
`.`	Denotes the box number based on the count of ".".

Terminology glossary

Word	Description	Example
`raft`	A character (`=`) used to prefix a special character (not a-Z or 0-9).	`=.`
`ripple`	A character that would make up a wave. Does not include `=`, `*` or `%`.	`^`
`splash`	A character (`*`) used to capitalize the following wave.	`*_<.`
`tide`	A collection of waves to form a word, where the word is the tide.	`*_-.~-.^>..^>..~>..`
`wave`	A single character encoded into oceanscript.	`^<..`

Python Implementation

As a programmer, I just had to make a Python library for this. Once upon a time, it was all just a looming thought in my mind with so much potential. Its been great to be able to create a working usable program for it, for anyone to use and play around with.

Start by importing the module:

import oceanscript

Encoding into oceanscript

Use the oceanscript.encode method. This method also takes an optional keyword-only argument "mode", which decides whether the encoder uses commas or line breaks to replace spaces. Specify stretch for line breaks, or squash for commas. Defaults to squash.

..^>..~>..' >>> oceanscript.encode("Hello world!") '*_-.~-.^>..^>..~>..,~-...~>.._>..^>..~<.=!' >>> oceanscript.encode("Hello world!", mode="stretch") '*_-.~-.^>..^>..~>..' '~-...~>.._>..^>..~<.=!'">

>>> oceanscript.encode("hello")
'_-.~-.^>..^>..~>..'

>>> oceanscript.encode("Hello world!")
'*_-.~-.^>..^>..~>..,~-...~>.._>..^>..~<.=!'

>>> oceanscript.encode("Hello world!", mode="stretch")
'*_-.~-.^>..^>..~>..'
'~-...~>.._>..^>..~<.=!'

Decoding from oceanscript

Use the oceanscript.decode method. Both modes from the encode method are compatible with decoding (you don't have to specify a mode here).

..^>..~>..") 'hello' >>> oceanscript.decode("*_-.~-.^>..^>..~>..,~-...~>.._>..^>..~<.=!") 'Hello world!' >>> text = """ *_-.~-.^>..^>..~>.. ~-...~>.._>..^>..~<.=! """ >>> oceanscript.decode(text) 'Hello world!'">

>>> oceanscript.decode("_-.~-.^>..^>..~>..")
'hello'

>>> oceanscript.decode("*_-.~-.^>..^>..~>..,~-...~>.._>..^>..~<.=!")
'Hello world!'

>>> text = """
    *_-.~-.^>..^>..~>..
    ~-...~>.._>..^>..~<.=!
    """
>>> oceanscript.decode(text)
'Hello world!'

If the oceanscript doesn't look quite right, the parser won't like it. oceanscript.OceanScriptError is thrown, but the traceback details are fairly useful for correcting these mistakes.

OceanScriptError has a position attribute, which is the string index in where the exception was raised (at the start of the wave).

....") # capitalizing int OceanScriptError: Splash indicator not allowed for integers (position 0) >>> oceanscript.decode("=a") # rafting ascii value OceanScriptError: Do not use lowercase ascii letters or digits on a raft ('=a'). Use '^<.' instead. (position 0) >>> oceanscript.decode("^-.~>..#>..") # invalid row indicator '#' OceanScriptError: '#' is not a valid row indicator (position 7) >>> oceanscript.decode("^+...") # invalid column indicator '+' OceanScriptError: '^' indicator expected '<', '-', or '>', but received '+' instead (position 0)">

>>> oceanscript.decode("*>....") # capitalizing int
OceanScriptError: Splash indicator not allowed for integers (position 0)

>>> oceanscript.decode("=a") # rafting ascii value
OceanScriptError: Do not use lowercase ascii letters or digits on a raft ('=a'). Use '^<.' instead. (position 0)

>>> oceanscript.decode("^-.~>..#>..") # invalid row indicator '#'
OceanScriptError: '#' is not a valid row indicator (position 7)

>>> oceanscript.decode("^+...") # invalid column indicator '+'
OceanScriptError: '^' indicator expected '<', '-', or '>', but received '+' instead (position 0)

Other tracebacks can appear, too.

Installation

Install from the recommended package installer, pip.

pip install oceanscript

License

Licensed under MIT.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2k Feb 9, 2021

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

stsb_multi_mt_en STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 an

2 Nov 5, 2021

Converts text into a PDF of handwritten notes

Text To Handwritten Notes Converts text into a PDF of handwritten notes Explore the docs » · Report Bug · Request Feature · Steps: $ git clone https:/

63 Oct 9, 2022

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库，可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

357 Dec 24, 2022

Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

3 Apr 5, 2022

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

1 Apr 28, 2022

Comments

Add support for BOX-4 truncation
I am making a proposal to allow truncation of BOX-4 characters (1-9) to nullify the distorted/stretched waves produced when chaining integers together in a tide.

There are various ideas within this proposal, but I'm fairly settled with idea 1.

Idea 1

This idea postulates the use of the lower-case letter o to replace the use of .....

Previously 123456789 -> ^<....^-....^>....~<....~-....~>...._<...._-...._>....

With proposal 123456789 -> ^<o^-o^>o~<o~-o~>o_<o_-o_>o

Proposal Advantages

The shape of the letter "o" resembles bubbles, or even fish mouths open. This adheres to the theme created through oceanscript.

It truncates the encoded string by almost 50%.

It makes large numbers, such as phone numbers, look more presentable.

It really suits the characters around it well, so that the "o"'s do not look out of place.

Proposal Disadvantages

Migration.

Idea 2

This idea postulates the use of the degree symbol (°) to replace the use of .....

Previously 123456789 -> ^<....^-....^>....~<....~-....~>...._<...._-...._>....

With proposal 123456789 -> ^<°^-°^>°~<°~-°~>°_<°_-°_>°

Proposal Advantages

The shape of the "°" resembles bubbles, floating upwards. This adheres to the theme created through oceanscript.

The degrees symbol reinstantiates the diversity in the multitude of identifier heights used.

It truncates the encoded string by almost 50%.

It makes large numbers, such as phone numbers, look more presentable.

Proposal Disadvantages

Migration.

"°" not available across the majority of smartphone and PC keyboards worldwide. This is a major drawback especially due to the common use of numbers in our day to day lives.

Idea 3

This idea postulates the use of the colon character (:) to replace every occurance of 2 dots (..)

Previously 123456789 -> ^<....^-....^>....~<....~-....~>...._<...._-...._>....

With proposal 123456789 -> ^<::^-::^>::~<::~-::~>::_<::_-::_>::

Proposal Advantages

The shape of the ":" partially resembles bubbles. This partially adheres to the theme created through oceanscript.

It truncates the encoded string by about 25%.

It makes large numbers, such as phone numbers, look better presentable.

Proposal Disadvantages

Migration.

Only a partial adherence to oceanscript's theme. It starts to look "blocky" when multiple integers are lined up against each other.

Summary

Best appearance?

[ ] Idea 1

[x] Idea 2

[ ] Idea 3

Best accessibility?

[x] Idea 1

[ ] Idea 2

[ ] Idea 3

Easiest migration?

[x] Idea 1

[x] Idea 2

[ ] Idea 3

Best adherence to theme?

[ ] Idea 1

[x] Idea 2

[ ] Idea 3

Best truncation?

[x] Idea 1

[x] Idea 2

[ ] Idea 3

Seeing as idea 3 does not qualify for any of these checks, it should be expunged from the list of ideas. Whilst idea 2 has 4 checks, idea 1 only has 3, falling 1 short. However, accessibility is a massive drawback for idea 2, meaning that idea 1 should probably pass through for only this reason.
enhancement question
opened by Kreusada 1

Releases(v2.3.0)

v2.3.0(Feb 26, 2022)
Ahhh, so much time went into this release.

Documentation for oceanscript is now available! https://oceanscript.readthedocs.io/en/latest/

.... is nearing deprecation. Use o instead. o will now take precedence by the encoder.

Source code(tar.gz)
Source code(zip)
v2.2.1(Feb 21, 2022)

This releases fixes the decoder's traceback suggestion when decoding space character. The decoder previously recommended =, it will now recommend ,.
Source code(tar.gz)
Source code(zip)
v2.2.0(Feb 21, 2022)

BREAKING CHANGE

Text is no longer whitespace stripped by the encoder or decoder. Previously, encoding "\n" would return "", instead of "%". Other issues were also prominent.
Source code(tar.gz)
Source code(zip)
v2.1.3(Feb 21, 2022)

This release fixes sections of the documentation and attempts to fix splitwaves().
Source code(tar.gz)
Source code(zip)
v2.1.2(Feb 19, 2022)

This release fixes the splitwaves() function from omitting special one length identifiers from the returned tuple when the include_invalid kwarg is set to False. Also adds unit tests.
Source code(tar.gz)
Source code(zip)
v2.1.1(Feb 18, 2022)

This release fixes the documentation showcasing new and improved functionality post 2.1.0 incorrectly.
Source code(tar.gz)
Source code(zip)
v2.1.0(Feb 18, 2022)
This release exposes splitwaves as a public method used to split waves in an oceanscript string. Additionally, tracebacks are now massively more precise, consistent, and more specific on the error caused when decoding.

This release's hot points:

Enhanced traceback detail

New splitwaves() functionality

Splash indicator severity majorly increased

OceanScriptError.without_position_reference() functionality

The following rules have been added to the splash indicator which may cause strings before v2.1.0 to break:

Splash indicators are now strictly not allowed to be prefixing non-alphabetic waves, and will raise an error post 2.1.0.

Splash redundancy is now evaluated and will raise an error when using the splash indicator for already capitalized alphabetic characters.

Updates for error handling:

without_position_reference() method has been added for OceanScriptError to return OceanScriptError.__str__() without the position referenced prefixed at the beginning of the string.

The position kwarg is now Optional.

Source code(tar.gz)
Source code(zip)
v2.0.3(Feb 13, 2022)

This release fixes minor typos in the documentation and removes redundancy in the code such as useless continue statements.
Source code(tar.gz)
Source code(zip)
v2.0.2(Feb 13, 2022)

This release includes improvements to traceback specifications such as using "indicator" over "marker", as this common reference was later downturned in favor.
Source code(tar.gz)
Source code(zip)

Owner

Hola!

GitHub Repository

A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

2.1k Jan 01, 2023

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

1.6k Dec 29, 2022

Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

3 Apr 15, 2022

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

736 Jan 03, 2023

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

5 Jan 03, 2022

A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

1 Feb 16, 2022

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

57 Dec 16, 2022

MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

228 Nov 21, 2022

EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Pre-train or Annotate? Domain Adaptation with a Constrained Budget This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Ann

8 Dec 17, 2021

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

407 Jan 03, 2023

Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

847 Dec 19, 2022

Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Highlights The strongest performances Tracker

485 Jan 04, 2023

Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

387 Dec 30, 2022

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

4 Oct 15, 2022

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Related tags

Overview

OceanScript Esoteric Language

Overview

How does it work?

Exemplar a

Exemplar x

Joining waves

Joining tides

Special characters

Capitalization

Line breaks

All identifiers

Terminology glossary

Python Implementation

Encoding into oceanscript

Decoding from oceanscript

Installation

License

You might also like...

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Converts text into a PDF of handwritten notes

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

Question answering app is used to answer for a user given question from user given text.

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Comments

Add support for BOX-4 truncation

Idea 1

Proposal Advantages

Proposal Disadvantages

Idea 2

Proposal Advantages

Proposal Disadvantages

Idea 3

Proposal Advantages

Proposal Disadvantages

Summary

Releases(v2.3.0)

v2.3.0(Feb 26, 2022)

v2.2.1(Feb 21, 2022)

v2.2.0(Feb 21, 2022)

v2.1.3(Feb 21, 2022)

v2.1.2(Feb 19, 2022)

v2.1.1(Feb 18, 2022)

v2.1.0(Feb 18, 2022)

v2.0.3(Feb 13, 2022)

v2.0.2(Feb 13, 2022)

Owner

A multi-voice TTS system trained with an emphasis on quality

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Fidibo.com comments Sentiment Analyser

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

A simple chatbot based on chatterbot that you can use for anything has basic features

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

MPNet: Masked and Permuted Pre-training for Language Understanding

EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Amazon Multilingual Counterfactual Dataset (AMCD)

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Unsupervised text tokenizer focused on computational efficiency

Learning Spatio-Temporal Transformer for Visual Tracking

Pretrained Japanese BERT models

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Chinese segmentation library

Constituency Tree Labeling Tool

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Exemplar `a`

Exemplar `x`