Snowball compiler and stemming algorithms

Last update: Jan 07, 2023

Related tags

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Snowball compiler and stemming algorithms

Related tags

Overview

What is Stemming?

Owner

Snowball Stemming language and algorithms

leaking paid token generator that was a shit lmao for 100$ haha

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

A BERT-based reverse-dictionary of Korean proverbs

SummerTime - Text Summarization Toolkit for Non-experts

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

txtai: Build AI-powered semantic search applications in Go

Python powered crossword generator with database with 20k+ polish words

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Checking spelling of form elements

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Train 🤗-transformers model with Poutyne.

An open collection of annotated voices in Japanese language

Official Stanford NLP Python Library for Many Human Languages

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.