Maha is a text processing library specially developed to deal with Arabic text.

Last update: Nov 27, 2022

Overview

An Arabic text processing library intended for use in NLP applications

Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments

Time: Add the ability to parse Hijri dates
What does this pull request change?

Closes #27.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 6
Added distance to dimension parsing
What does this pull request change?

Resolves #15.

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

parsing highlight
opened by TRoboto 5
Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names
What does this pull request change?

This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 4
Add pyupgrade to pre-commit and upgrade to future-style type annotations
What does this pull request change?

Upgrades to new type annotations style.

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

maintenance
opened by TRoboto 3
Deprecate and remove `datasets` module and host datasets on Hugging Face instead
What does this pull request change?

Removes datasets module.

Datasets are now hosted here

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

breaking changes deprecation
opened by TRoboto 3
Add the ability to parse names from text
What does this pull request change?

Adds #24. Depends on #40

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 3
Add a deprecation system
What does this pull request change?

Closes #23

Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

development
opened by saedx1 3
Prepare for the next release of Maha (v0.3.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.3.0.

Bumped pypi version to v0.3.0.

Updated the citation information.
opened by github-actions[bot] 2
Ordinal: Add support to `بعد` in ordinal parsing
What does this pull request change?

Closes #48.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Numeral: Add support for hierarchical parsing
What does this pull request change?

Closes #25

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Prepare for the next release of Maha (v0.2.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.2.0.

Bumped pypi version to v0.2.0.

Updated the citation information.
opened by github-actions[bot] 2
Update ci.yml
Check the support for python 3,10

What does this pull request change? It checks if the library is supporting python 3.10.

...

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[ ] tox passes
opened by PAIN-BARHAM 1
[pre-commit.ci] pre-commit autoupdate
updates:

github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.4.0

github.com/psf/black: 22.6.0 → 22.12.0

github.com/pycqa/isort: 5.10.1 → 5.11.4

github.com/asottile/pyupgrade: v2.37.3 → v3.3.1
opened by pre-commit-ci[bot] 1
Add the option to ignore Harakat when removing or replacing
What problem are you trying to solve?

Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

Examples (if relevant)

Current:

>> from maha.cleaners.functions import remove >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة") >> output يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى

Suggested:

>> from maha.cleaners.functions import remove >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True) >> output يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى

Definition of Done

It must adhere to the coding style used in the defined cleaner functions.

The implementation should cover most use cases.

Adding tests

feature request
opened by xaleel 1
Wrong parsed name using name dimension
What happened?

The name parser extracted wrong name likes : بي, شكرا.

Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

I expect to extract the names on the name dataset only.

Python version

3.8

What operating system are you using?

Linux

Code to reproduce the issue

>>> from maha.parsers.functions import parse_dimension >>> text = `أريد البحث في سجل الإنفاق الخاص بي` >>> extracted = parse_dimension(text, names=True) [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

Relevant log output

No response
bug parsing
opened by PAIN-BARHAM 0
Add feature to parse duration period
What problem are you trying to solve?

Parsing the duration from the text that has the difference between the two dates.

Examples (if relevant)

>>> from maha.parsers.functions import parse_dimension >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value >>> output DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)

Definition of Done

It must adhere to the coding style used in the defined dimensions, duration dimension.

The implementation should cover most use cases.

Adding tests

feature request
opened by PAIN-BARHAM 1

Adding the parser functionality to Processors

What problem are you trying to solve?

Adding the parser functionality to Processors to parse different dimensions.

Examples (if relevant)

>>> from pathlib import Path
>>> import maha
>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
>>> data = resource_path.read_text()
>>> print(data)

الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
لما حد يسالني بتختفي كتير لية =..
زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
#Windows11 is on the horizon. What feature are you looking forward to
Get vaccinate #savethesaviour
Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit

>>> from maha.processors import FileProcessor
>>> proc = FileProcessor(resource_path)
>>> parsed = proc.parse_dimension(time=True)
[Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
 Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
 Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]

Definition of Done

It must adhere to the coding style.
The implementation should cover most use cases.
Adding tests.

good first issue feature request parsing

opened by PAIN-BARHAM 0

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 16, 2021)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.1.2(Sep 23, 2021)
Quick fix:

Added readme badges

Fixed missing regex dependency

Source code(tar.gz)
Source code(zip)

Owner

Mohammad Al-Fetyani

Machine Learning Engineer

GitHub Repository

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

105 Jan 08, 2022

Long text token classification using LongFormer

161 Aug 07, 2022

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

76 Dec 14, 2022

What are the best Systems? New Perspectives on NLP Benchmarking

What are the best Systems? New Perspectives on NLP Benchmarking In Machine Learning, a benchmark refers to an ensemble of datasets associated with one

12 Nov 03, 2022

Sample data associated with the Aurora-BP study

The Aurora-BP Study and Dataset This repository contains sample code, sample data, and explanatory information for working with the Aurora-BP dataset

16 Dec 12, 2022

Ask for weather information like a human

weather-nlp About Ask for weather information like a human. Goals Understand typical questions like: Hourly temperatures in Potsdam on 2020-09-15. Rai

5 Oct 29, 2022

This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

37 Nov 19, 2022

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

110 Jan 01, 2023

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022

Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

1.1k Jan 01, 2023

Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

1 Oct 18, 2022

Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

16 Dec 28, 2022

A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

891 Dec 22, 2022

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

775 Jan 08, 2023

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

21 Jul 18, 2022

Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

7 Jan 15, 2022

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

99 Jan 06, 2023

Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

3 Apr 05, 2022

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 06, 2023

Maha is a text processing library specially developed to deal with Arabic text.

Related tags

Overview

Installation

Overview

Documentation

Contributing

License

Comments

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What happened?

Python version

What operating system are you using?

Code to reproduce the issue

Relevant log output

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

v0.2.0(Nov 16, 2021)

v0.1.2(Sep 23, 2021)

Owner

Mohammad Al-Fetyani

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Long text token classification using LongFormer

Lingtrain Aligner — ML powered library for the accurate texts alignment.

What are the best Systems? New Perspectives on NLP Benchmarking

Sample data associated with the Aurora-BP study

Ask for weather information like a human

This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

HF's ML for Audio study group

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

Resources for "Natural Language Processing" Coursera course.

Understand Text Summarization and create your own summarizer in python

Transformation spoken text to written text

A curated list of efficient attention modules

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

Nateve compiler developed with python.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

Question answering app is used to answer for a user given question from user given text.

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation