Japanese NLP Library

Overview

Japanese NLP Library


Back to Home

1   Requirements

1.1   Links

  • All code at jProcessing Repo GitHub
  • PyPi Python Package
clone [email protected]:kevincobain2000/jProcessing.git

1.2   Install

In Terminal

bash$ python setup.py install

1.3   History

  • 0.2

    • Sentiment Analysis of Japanese Text
  • 0.1
    • Morphologically Tokenize Japanese Sentence
    • Kanji / Hiragana / Katakana to Romaji Converter
    • Edict Dictionary Search - borrowed
    • Edict Examples Search - incomplete
    • Sentence Similarity between two JP Sentences
    • Run Cabocha(ISO--8859-1 configured) in Python.
    • Longest Common String between Sentences
    • Kanji to Katakana Pronunciation
    • Hiragana, Katakana Chart Parser

2   Libraries and Modules

2.1   Tokenize jTokenize.py

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--5--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2   Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

">
<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="" base="" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="" base="" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="" base="" pos="名詞-数" ctype="" cform="" ne="B-DATE">5tok>
  <tok id="5" read="ニチ" base="" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="" base="" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3   Kanji / Katakana /Hiragana to Tokenized Romaji jConvert.py

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)
...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

2.4   Longest Common String Japanese jProcessing.py

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5   Similarity between two sentences jProcessing.py

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:
>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444
Japanese Strings:
>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか?'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3   Edict Japanese Dictionary Search with Example sentences

3.1   Sample Ouput Demo

3.2   Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3   Charset

Two files

  • utf8 Charset example file if not using src/jNlp/data/edict_examples

    To convert EUCJP/ISO-8859-1 to utf8

    iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
    
  • ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4   Links

Latest Dictionary files can be downloaded here

3.5   edict_search.py

author: Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6   edict_examples.py

Note: Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author: Pulkit Kathuria
>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 x線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4   Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1   Wordnet files download links

  1. http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html
  2. http://sentiwordnet.isti.cnr.it/

4.2   How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

  • (Adnouns, nouns, verbs, .. all included)
  • No WSD module on Japanese Sentence
  • Uses word as its common sense for polarity score
>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3   Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5   Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]
An extensive UI tool built using new data scraped from BBC News

BBC-News-Analyzer An extensive UI tool built using new data scraped from BBC New

Antoreep Jana 1 Dec 31, 2021
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Kensho 315 Dec 21, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
Weird Sort-and-Compress Thing

Weird Sort-and-Compress Thing A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by

Douglas 1 Jan 03, 2022
Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

Adarsh Reddy 1 Dec 22, 2021
A NLP program: tokenize method, PoS Tagging with deep learning

IRIS NLP SYSTEM A NLP program: tokenize method, PoS Tagging with deep learning Report Bug · Request Feature Table of Contents About The Project Built

Zakaria 7 Dec 13, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
Yuqing Xie 2 Feb 17, 2022