Repository containing the code for An-Gocair text normaliser

Last update: Jun 28, 2022

Related tags

Overview

Scottish Gaelic Text Normaliser

The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can be cloned and top level functions will allow you to normalise phrases or whole documents.

Installation

To use the program you will have to clone the repo and install dependencies in a python virtualenvironment using python 3 and above.

instructions

from GaelicTextNormaliser import TextNormaliser

normaliser = TextNormaliser(from_config="config.yaml")

normaliser.normalise_doc(doc="Bha rìgh òg Easaidh Ruagh an dèigh dha'n oighreachd fhaotainn da fèin ri mòran àbhachd, ag amharc a mach dè a chordadh ris,'s dè thigeadh r'a nadur.")

"Bha rìgh òg Easaidh Ruadh an dèidh dhan oighreachd fhaotainn da fhèin ri mòran àbhachd, ag amharc a-mach dè a chòrdadh ris,'s dè thigeadh ra nàdar."

Alternatively there is a webapp that can be found at https://www.garg.ed.ac.uk/an_gocair.

Acknowledgements

Scottish Gaelic Lexicon

The lexicon file is provided by Michael Bauer, Scottish Gaelic linguist, author and lead collaborator on the Am Faclaer Baeg SG dictionary. The lexicon is a reformatted version of the dictionary that makes use of Michael's extensive labelling of traditional Gaelic spellings and common misspellings. The resource is extremely vital for the success of the memory based approach.

Rules for Normalisation

The lexical and grammatical rules for normalisation were the result of collaboration between the project leader Dr Will Lamb and Baeur. Both Lamb and Bauer, as fluent Gaelic speakers and experienced proof readers, were able to provide the linguistic rules to be translated into python code.

Scottish Gaelic Part of Speech Tagger

For further conditioning in the rule based approach, part of speech tags were necessary. The code and models for POS tagging is very kindly provided by Loïc Boizou. The scripts were altered slightly to work within the python object.

Further Acknowledgements

This program was funded by the Data-Driven Innovation initiative (DDI), delivered by the University of Edinburgh and Heriot-Watt University for the Edinburgh and South East Scotland City Region Deal. DDI is an innovation network helping organisations tackle challenges for industry and society by doing data right to support Edinburgh in its ambition to become the data capital of Europe. The project was delivered by the Edinburgh Futures Institute (EFI), one of five DDI innovation hubs which collaborates with industry, government and communities to build a challenge-led and data-rich portfolio of activity that has an enduring impact.

Repository containing the code for An-Gocair text normaliser

Related tags

Overview

Scottish Gaelic Text Normaliser

Installation

instructions

Acknowledgements

Scottish Gaelic Lexicon

Rules for Normalisation

Scottish Gaelic Part of Speech Tagger

Further Acknowledgements

Owner

An implementation of figlet written in Python

A non-validating SQL parser module for Python

py-trans is a Free Python library for translate text into different languages.

A python tool one can extract the "hash" from a WINDOWS HELLO PIN

This repository contains scripts to control a RGB text fan attached to a Raspberry Pi.

Split large XML files into smaller ones for easy upload

汉字转拼音(pypinyin)

A pipeline for making highlighted text stand-alone.

utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags.

This is REST-API for Indonesian Text Summarization using Non-Negative Matrix Factorization for the algorithm to summarize documents and FastAPI for the framework.

The app gets your sutitle.srt and proccess it to extract sentences

strbind - lapidary text converter for translate an text file to the C-style string

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

An experimental Fang Song style Chinese font generated with skeleton-tracing and pix2pix

A neat little program to read the text from the "All Ten Fingers" program, and write them back.

Python Q&A for Network Engineers

Export solved codewars kata challenges to a text file.

A username generator made from French Canadian most common names.

Username reconnaisance tool that checks the availability of a specified username on over 200 websites.

A minimal code sceleton for a textadveture parser written in python.