Japanese NLP Library

Overview

Japanese NLP Library


Back to Home

1   Requirements

1.1   Links

  • All code at jProcessing Repo GitHub
  • PyPi Python Package
clone [email protected]:kevincobain2000/jProcessing.git

1.2   Install

In Terminal

bash$ python setup.py install

1.3   History

  • 0.2

    • Sentiment Analysis of Japanese Text
  • 0.1
    • Morphologically Tokenize Japanese Sentence
    • Kanji / Hiragana / Katakana to Romaji Converter
    • Edict Dictionary Search - borrowed
    • Edict Examples Search - incomplete
    • Sentence Similarity between two JP Sentences
    • Run Cabocha(ISO--8859-1 configured) in Python.
    • Longest Common String between Sentences
    • Kanji to Katakana Pronunciation
    • Hiragana, Katakana Chart Parser

2   Libraries and Modules

2.1   Tokenize jTokenize.py

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--5--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2   Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

">
<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="" base="" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="" base="" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="" base="" pos="名詞-数" ctype="" cform="" ne="B-DATE">5tok>
  <tok id="5" read="ニチ" base="" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="" base="" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3   Kanji / Katakana /Hiragana to Tokenized Romaji jConvert.py

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)
...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

2.4   Longest Common String Japanese jProcessing.py

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5   Similarity between two sentences jProcessing.py

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:
>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444
Japanese Strings:
>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか?'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3   Edict Japanese Dictionary Search with Example sentences

3.1   Sample Ouput Demo

3.2   Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3   Charset

Two files

  • utf8 Charset example file if not using src/jNlp/data/edict_examples

    To convert EUCJP/ISO-8859-1 to utf8

    iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
    
  • ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4   Links

Latest Dictionary files can be downloaded here

3.5   edict_search.py

author: Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6   edict_examples.py

Note: Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author: Pulkit Kathuria
>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 x線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4   Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1   Wordnet files download links

  1. http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html
  2. http://sentiwordnet.isti.cnr.it/

4.2   How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

  • (Adnouns, nouns, verbs, .. all included)
  • No WSD module on Japanese Sentence
  • Uses word as its common sense for polarity score
>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3   Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5   Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Twitter-Sentiment-Analysis The hands-on project is in Python 3 Programming class offered by University of Michigan via Coursera. The task is to build

Eszter Pai 1 Jan 03, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

Timo Schick 1.4k Dec 30, 2022
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023