Get list of common stop words in various languages in Python

Last update: Dec 21, 2022

Overview

Python Stop Words

Table of contents

Overview
Available languages
Installation
Basic usage
Python compatibility

Overview

Get list of common stop words in various languages in Python.

Available languages

Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

Python 2.7
Python 3.4
Python 3.5
Python 3.6
Python 3.7

Comments

Enforces packaging of eggs into folders.

We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

opened by hfjn 10
add indonesian stop word list

Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

opened by frankdevans 4
can you handle a text？

hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

thank you very much!!
question

opened by PapaMadeleine2022 2
Python 3 support
List of improvements:

Tests

Python 3 support

Dev installation via zc.buildout

Continuous integration via Travis

Can you make a new release once the branch merged ?

Regards
enhancement
opened by Fantomas42 2
languages.json is missing, if you don't git clone with `--recursive`

languages.json is still missing, if you don't clone with --recursive

$ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

opened by marcindulak 1
Update submodule to the latest

Include the stops for newly added languages

https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7
enhancement

opened by norkans7 1
Decode error AND Add catalan language to LANGUAGE_MAPPING
1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

2. Decode error

stop_words = [line.strip().decode('utf-8') for line in language_file.readlines()]

Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

The workaround is to reorder the call:

stop_words = [line.decode('utf-8').strip() for line in language_file.readlines()]
opened by dmiro 1
Defining custom stop words in NLTK

Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

Hope you can help me thanks.

opened by AllikDaniel 0

Example not work on python 3.7.0

It return empty []

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')
print(stop_words)

opened by nadavvin 2

Releases(2018.7.23)

2018.7.23(Jul 23, 2018)
2018.7.23

Fixed #14: languages.json is missing, if you don't git clone with --recursive.

Feature: Support latest version of Python (3.7+).

Feature #22: Enforces packaging of eggs into folders.

Update the stop-words repository to get the latest languages.

Fixed Travis failing and tests due to bootstrap.

PyPI: https://pypi.org/project/stop-words/2018.7.23/

To install:

$ pip install stop-words==2018.7.23
Source code(tar.gz)
Source code(zip)
2015.2.23.1(Feb 23, 2015)
2015.2.23.1

Fix #9: Missing languages.json file that breaks the installation.

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.23
Source code(tar.gz)
Source code(zip)
2015.2.23(Feb 23, 2015)
2015.2.23

Feature: Using the cache is optional

Feature: Filtering stopwords

Special thanks to Taras Labiak @kissarat

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21
Source code(tar.gz)
Source code(zip)
2015.2.21(Feb 21, 2015)
2015.2.21

Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json

Fix: Made paths OS-independent

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

Special thanks to Taras Labiak @kissarat
Source code(tar.gz)
Source code(zip)
2015.1.31(Feb 1, 2015)
2015.1.31

Feature #5: Decode error AND Add catalan language to LANGUAGE_MAPPING.

Feature: Update stop-words dictionary.

Source code(tar.gz)
Source code(zip)
2015.1.22(Jan 22, 2015)
2015.1.22

Feature: Tests

Feature: Python 3 support

Feature: Dev installation via zc.buildout

Feature: Continuous integration via Travis

pypi: https://pypi.python.org/pypi/stop-words/2015.1.22
Source code(tar.gz)
Source code(zip)
2015.1.19(Jan 19, 2015)
2015.1.19

Feature #3: Handle language code, cache and custom errors

Source code(tar.gz)
Source code(zip)

Owner

Alireza Savand

I am Alireza Savand, a Software Architect.

GitHub Repository https://pypi.org/project/stop-words/

Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

3 Oct 06, 2022

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020：基于标题的大规模商品实体检索，任务为对于给定的一个商品标题，参赛系统需要匹配到该标题在给定商品库中的对应商品实体。输入：输入文件包括若干行商品标题。输出：输出文本每一行包括此标题对应的商品实体，即给定知识库中商品 ID，

43 Nov 11, 2022

Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022

Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

121 Jan 05, 2023

The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

4 Jan 22, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 03, 2023

End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

4 Nov 06, 2022

CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

38 Dec 31, 2022

Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

32 Jul 29, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 07, 2023

Get list of common stop words in various languages in Python

Related tags

Overview

Python Stop Words

Comments

Releases(2018.7.23)

2018.7.23(Jul 23, 2018)

2018.7.23

2015.2.23.1(Feb 23, 2015)

2015.2.23.1

2015.2.23(Feb 23, 2015)

2015.2.23

2015.2.21(Feb 21, 2015)

2015.2.21

2015.1.31(Feb 1, 2015)

2015.1.31

2015.1.22(Jan 22, 2015)

2015.1.22

2015.1.19(Jan 19, 2015)

2015.1.19

Owner

Alireza Savand

Natural Language Processing Specialization

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

Multiple implementations for abstractive text summurization , using google colab

Yet Another Neural Machine Translation Toolkit

The official repository of the ISBI 2022 KNIGHT Challenge

Simple Speech to Text, Text to Speech

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

End-to-end MLOps pipeline of a BERT model for emotion classification.

CorNet Correlation Networks for Extreme Multi-label Text Classification

Text to speech for Vietnamese, ez to use, ez to update

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Fastseq 基于ONNXRUNTIME的文本生成加速框架

This is the source code of RPG (Reward-Randomized Policy Gradient)

Text to speech converter with GUI made in Python.

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

wxPython app for converting encodings, modifying and fixing SRT files

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Machine Psychology: Python Generated Art