Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese

These details have not been verified by PyPI

Project links

Project description

UniDic2UD

Tokenizer, POS-tagger, lemmatizer, and dependency-parser for modern and contemporary Japanese, working on Universal Dependencies.

Basic usage

>>> import unidic2ud
>>> nlp=unidic2ud.load("kindai")
>>> s=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s)
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1	其	其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2	國	国	NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3	を	を	ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=ヲ
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5	ん	む	AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=ン
6	と	と	ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=ト
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8	者	者	NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9	は	は	ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=ハ
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11	其	其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12	家	家	NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13	を	を	ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=ヲ
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=s[7]
>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)
7 欲する 欲する VERB 動詞-一般 _ 8 acl _ SpaceAfter=No|Translit=ホッスル

>>> print(s.to_tree())
    其 <══╗         det(決定詞)
    國 ═╗═╝<╗       obj(目的語)
    を <╝   ║       case(格表示)
  治め ═╗═╗═╝<╗     advcl(連用修飾節)
    ん <╝ ║   ║     aux(動詞補助成分)
    と <══╝   ║     case(格表示)
欲する ═══════╝<╗   acl(連体修飾節)
    者 ═╗═══════╝<╗ nsubj(主語)
    は <╝         ║ case(格表示)
  先づ <══════╗   ║ advmod(連用修飾語)
    其 <══╗   ║   ║ det(決定詞)
    家 ═╗═╝<╗ ║   ║ obj(目的語)
    を <╝   ║ ║   ║ case(格表示)
  齊ふ ═════╝═╝═══╝ root(親)

>>> f=open("trial.svg","w")
>>> f.write(s.to_svg())
>>> f.close()

unidic2ud.load(UniDic,UDPipe) loads a natural language processor pipeline, which uses UniDic for tokenizer POS-tagger and lemmatizer, then uses UDPipe for dependency-parser. The default UDPipe is UDPipe="japanese-modern". Available UniDic options are:

UniDic="gendai": Use 現代書き言葉UniDic.
UniDic="spoken": Use 現代話し言葉UniDic.
UniDic="novel": Use 近現代口語小説UniDic.
UniDic="qkana": Use 旧仮名口語UniDic.
UniDic="kindai": Use 近代文語UniDic.
UniDic="kinsei": Use 近世江戸口語UniDic.
UniDic="kyogen": Use 中世口語UniDic.
UniDic="wakan": Use 中世文語UniDic.
UniDic="wabun": Use 中古和文UniDic.
UniDic="manyo": Use 上代語UniDic.
UniDic=None: Use UDPipe for tokenizer, POS-tagger, lemmatizer, and dependency-parser.

unidic2ud.UniDic2UDEntry.to_tree() has an option to_tree(BoxDrawingWidth=2) for old terminals, whose Box Drawing characters are "fullwidth".

You can simply use unidic2ud on the command line:

echo 其國を治めんと欲する者は先づ其家を齊ふ | unidic2ud -U kindai

CaboCha emulator usage

>>> import unidic2ud.cabocha as CaboCha
>>> c=CaboCha.Parser("kindai")
>>> s=c.parse("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s.toString(CaboCha.FORMAT_TREE_LATTICE))
  其-D
  國を-D
治めんと-D
    欲する-D
        者は-------D
          先づ-----D
              其-D |
              家を-D
                齊ふ
EOS
* 0 1D 0/0 0.000000
其	連体詞,*,*,*,*,*,其の,ソノ,*,DET	O	1<-det-2
* 1 2D 0/1 0.000000
國	名詞,普通名詞,一般,*,*,*,国,クニ,*,NOUN	O	2<-obj-4
を	助詞,格助詞,*,*,*,*,を,ヲ,*,ADP	O	3<-case-2
* 2 3D 0/1 0.000000
治め	動詞,一般,*,*,*,*,収める,オサメ,*,VERB	O	4<-advcl-7
ん	助動詞,*,*,*,*,*,む,ン,*,AUX	O	5<-aux-4
と	助詞,格助詞,*,*,*,*,と,ト,*,ADP	O	6<-case-4
* 3 4D 0/0 0.000000
欲する	動詞,一般,*,*,*,*,欲する,ホッスル,*,VERB	O	7<-acl-8
* 4 8D 0/1 0.000000
者	名詞,普通名詞,一般,*,*,*,者,モノ,*,NOUN	O	8<-nsubj-14
は	助詞,係助詞,*,*,*,*,は,ハ,*,ADP	O	9<-case-8
* 5 8D 0/0 0.000000
先づ	副詞,*,*,*,*,*,先ず,マヅ,*,ADV	O	10<-advmod-14
* 6 7D 0/0 0.000000
其	連体詞,*,*,*,*,*,其の,ソノ,*,DET	O	11<-det-12
* 7 8D 0/1 0.000000
家	名詞,普通名詞,一般,*,*,*,家,ウチ,*,NOUN	O	12<-obj-14
を	助詞,格助詞,*,*,*,*,を,ヲ,*,ADP	O	13<-case-12
* 8 -1D 0/0 0.000000
齊ふ	動詞,一般,*,*,*,*,整える,トトノフ,*,VERB	O	14<-root
EOS
>>> for c in [s.chunk(i) for i in range(s.chunk_size())]:
...   if c.link>=0:
...     print(c,"->",s.chunk(c.link))
...
其 -> 國を
國を -> 治めんと
治めんと -> 欲する
欲する -> 者は
者は -> 齊ふ
先づ -> 齊ふ
其 -> 家を
家を -> 齊ふ

CaboCha.Parser(UniDic) is an alias for unidic2ud.load(UniDic,UDPipe="japanese-modern"), and its default is UniDic=None. CaboCha.Tree.toString(format) has five available formats:

CaboCha.FORMAT_TREE: tree (numbered as 0)
CaboCha.FORMAT_LATTICE: lattice (numbered as 1)
CaboCha.FORMAT_TREE_LATTICE: tree + lattice (numbered as 2)
CaboCha.FORMAT_XML: XML (numbered as 3)
CaboCha.FORMAT_CONLL: Universal Dependencies CoNLL-U (numbered as 4)

You can simply use udcabocha on the command line:

echo 其國を治めんと欲する者は先づ其家を齊ふ | udcabocha -U kindai -f 2

-U UniDic specifies UniDic. -f format specifies the output format in 0 to 4 above (default is -f 0) and in 5 to 8 below:

-f 5: to_tree()
-f 6: to_tree(BoxDrawingWidth=2)
-f 7: to_svg()
-f 8: raw DOT graph through Immediate Catena Analysis

Try notebook for Google Colaboratory.

Usage via spaCy

If you have already installed spaCy 2.1.0 or later, you can use UniDic via spaCy Language pipeline.

>>> import unidic2ud.spacy
>>> nlp=unidic2ud.spacy.load("kindai")
>>> d=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(unidic2ud.spacy.to_conllu(d))
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1	其	其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2	國	国	NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3	を	を	ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=ヲ
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5	ん	む	AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=ン
6	と	と	ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=ト
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8	者	者	NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9	は	は	ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=ハ
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11	其	其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12	家	家	NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13	を	を	ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=ヲ
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=d[6]
>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)
7 欲する 欲する VERB 動詞-一般 8 acl  ホッスル

>>> from deplacy.deprelja import deprelja
>>> for b in unidic2ud.spacy.bunsetu_spans(d):
...   for t in b.lefts:
...     print(unidic2ud.spacy.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
其 -> 國を (決定詞)
國を -> 治めんと (目的語)
治めんと -> 欲する (連用修飾節)
欲する -> 者は (連体修飾節)
其 -> 家を (決定詞)
者は -> 齊ふ (主語)
先づ -> 齊ふ (連用修飾語)
家を -> 齊ふ (目的語)

unidic2ud.spacy.load(UniDic,parser) loads a spaCy pipeline, which uses UniDic for tokenizer POS-tagger and lemmatizer (as shown above), then uses parser for dependency-parser. The default parser is parser="japanese-modern" and available options are:

parser="ja_core_news_sm": Use spaCy Japanese model (small).
parser="ja_core_news_md": Use spaCy Japanese model (middle).
parser="ja_core_news_lg": Use spaCy Japanese model (large).
parser="ja_ginza": Use GiNZA.
parser="japanese-gsd": Use UDPipe Japanese model.
parser="stanza_ja": Use Stanza Japanese model.

Installation for Linux

Tar-ball is available for Linux, and is installed by default when you use pip:

pip install unidic2ud

By default installation, UniDic is invoked through Web APIs. If you want to invoke them locally and faster, you can download UniDic which you use just as follows:

python -m unidic2ud download kindai
python -m unidic2ud dictlist

Licenses of dictionaries and models are: GPL/LGPL/BSD for gendai and spoken; CC BY-NC-SA 4.0 for others.

Installation for Cygwin

Make sure to get gcc-g++ python37-pip python37-devel packages, and then:

pip3.7 install unidic2ud

Use python3.7 command in Cygwin instead of python.

Installation for Jupyter Notebook (Google Colaboratory)

!pip install unidic2ud

Benchmarks

Results of 舞姬/雪國/荒野より-Benchmarks

舞姬	LAS	MLAS	BLEX
UniDic="kindai"	81.13	70.37	77.78
UniDic="qkana"	79.25	70.37	77.78
UniDic="kinsei"	72.22	60.71	64.29

雪國	LAS	MLAS	BLEX
UniDic="qkana"	89.29	85.71	81.63
UniDic="kinsei"	89.29	85.71	77.55
UniDic="kindai"	84.96	81.63	77.55

荒野より	LAS	MLAS	BLEX
UniDic="kindai"	76.44	61.54	53.85
UniDic="qkana"	75.39	61.54	53.85
UniDic="kinsei"	71.88	58.97	51.28

Author

Koichi Yasuoka (安岡孝一)

References

安岡孝一: 形態素解析部の付け替えによる近代日本語(旧字旧仮名)の係り受け解析, 情報処理学会研究報告, Vol.2020-CH-124「人文科学とコンピュータ」, No.3 (2020年9月5日), pp.1-8.
安岡孝一: 漢日英Universal Dependencies平行コーパスとその差異, 人文科学とコンピュータシンポジウム「じんもんこん2019」論文集 (2019年12月), pp.43-50.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.0.7

Dec 29, 2025

3.0.6

Dec 24, 2024

3.0.5

Oct 30, 2024

3.0.4

Jan 31, 2024

3.0.3

Jan 10, 2024

3.0.2

Sep 25, 2023

3.0.1

Sep 24, 2023

3.0.0

Sep 24, 2023

2.9.9

Apr 17, 2023

2.9.8

Apr 17, 2023

2.9.7

Dec 28, 2022

2.9.6

Apr 24, 2022

2.9.5

Apr 24, 2022

2.9.4

Apr 24, 2022

2.9.3

Apr 15, 2022

2.9.2

Dec 18, 2021

2.9.1

Nov 8, 2021

2.9.0

Nov 7, 2021

2.8.9

Nov 7, 2021

2.8.8

Aug 14, 2021

2.8.7

Jul 9, 2021

2.8.6

May 13, 2021

2.8.5

May 3, 2021

2.8.4

Apr 29, 2021

2.8.3

Apr 10, 2021

2.8.2

Apr 4, 2021

2.8.1

Apr 3, 2021

2.8.0

Mar 13, 2021

2.7.9

Mar 11, 2021

2.7.8

Mar 6, 2021

2.7.7

Feb 21, 2021

2.7.6

Feb 20, 2021

2.7.5

Feb 20, 2021

2.7.4

Feb 10, 2021

2.7.3

Feb 4, 2021

2.7.2

Feb 1, 2021

2.7.1

Jan 27, 2021

2.7.0

Jan 25, 2021

2.6.9

Jan 24, 2021

2.6.8

Jan 23, 2021

2.6.7

Jan 9, 2021

2.6.6

Jan 5, 2021

2.6.5

Jan 4, 2021

2.6.4

Jan 4, 2021

2.6.3

Jan 3, 2021

2.6.2

Jan 1, 2021

2.6.1

Jan 1, 2021

2.6.0

Jan 1, 2021

2.5.9

Dec 31, 2020

2.5.8

Dec 28, 2020

2.5.7

Dec 27, 2020

2.5.6

Dec 24, 2020

2.5.5

Dec 10, 2020

2.5.4

Nov 23, 2020

2.5.3

Nov 20, 2020

2.5.2

Oct 30, 2020

2.5.1

Oct 20, 2020

2.5.0

Oct 18, 2020

2.4.9

Oct 9, 2020

2.4.8

Oct 6, 2020

2.4.7

Sep 21, 2020

2.4.6

Sep 19, 2020

2.4.5

Sep 14, 2020

2.4.4

Aug 29, 2020

2.4.3

Aug 28, 2020

2.4.2

Aug 25, 2020

2.4.1

Aug 18, 2020

2.4.0

Aug 14, 2020

2.3.9

Aug 14, 2020

2.3.8

Aug 12, 2020

2.3.7

Aug 12, 2020

2.3.6

Aug 5, 2020

2.3.5

Jul 31, 2020

2.3.4

Jul 29, 2020

2.3.3

Jul 28, 2020

2.3.2

Jul 25, 2020

2.3.1

Jul 25, 2020

2.3.0

Jul 16, 2020

2.2.9

Jul 12, 2020

2.2.8

Jul 8, 2020

2.2.7

Jul 8, 2020

2.2.6

Jul 7, 2020

2.2.4

Jul 7, 2020

2.2.3

Jul 7, 2020

2.2.2

Jul 6, 2020

2.2.1

Jul 3, 2020

2.2.0

Jul 2, 2020

2.1.9

Jul 2, 2020

2.1.8

Jul 2, 2020

2.1.7

Jul 1, 2020

2.1.6

Jun 25, 2020

2.1.5

Jun 19, 2020

2.1.4

May 28, 2020

2.1.3

May 27, 2020

2.1.2

May 27, 2020

2.1.1

May 25, 2020

2.1.0

May 4, 2020

2.0.9

May 3, 2020

2.0.8

May 3, 2020

2.0.7

May 1, 2020

2.0.6

May 1, 2020

2.0.5

Apr 28, 2020

2.0.4

Apr 16, 2020

2.0.3

Apr 12, 2020

2.0.2

Apr 10, 2020

2.0.1

Mar 31, 2020

2.0.0

Mar 26, 2020

1.9.9

Mar 26, 2020

1.9.8

Mar 21, 2020

1.9.7

Mar 20, 2020

1.9.6

Mar 15, 2020

1.9.5

Mar 13, 2020

1.9.4

Feb 24, 2020

1.9.3

Feb 10, 2020

1.9.2

Feb 9, 2020

1.9.1

Feb 8, 2020

1.9.0

Feb 8, 2020

1.8.9

Feb 8, 2020

1.8.8

Feb 4, 2020

1.8.7

Feb 4, 2020

1.8.6

Jan 20, 2020

1.8.5

Jan 17, 2020

1.8.4

Jan 14, 2020

1.8.3

Jan 14, 2020

1.8.2

Jan 14, 2020

1.8.1

Jan 14, 2020

1.8.0

Jan 13, 2020

1.7.9

Jan 10, 2020

1.7.8

Jan 4, 2020

1.7.7

Jan 4, 2020

1.7.6

Jan 3, 2020

1.7.5

Dec 31, 2019

1.7.4

Dec 30, 2019

1.7.3

Dec 29, 2019

1.7.2

Dec 29, 2019

1.7.1

Dec 29, 2019

1.7.0

Dec 28, 2019

1.6.9

Dec 28, 2019

1.6.8

Dec 27, 2019

1.6.7

Dec 27, 2019

1.6.6

Dec 26, 2019

1.6.5

Dec 26, 2019

1.6.4

Dec 21, 2019

1.6.3

Dec 20, 2019

1.6.2

Dec 20, 2019

1.6.0

Dec 20, 2019

1.5.9

Dec 18, 2019

1.5.8

Dec 14, 2019

1.5.7

Dec 14, 2019

1.5.6

Dec 14, 2019

1.5.5

Dec 10, 2019

1.5.4

Dec 10, 2019

1.5.3

Dec 10, 2019

1.5.2

Nov 27, 2019

1.5.1

Nov 26, 2019

1.5.0

Nov 22, 2019

1.4.9

Nov 22, 2019

1.4.8

Nov 22, 2019

1.4.7

Nov 21, 2019

1.4.6

Nov 21, 2019

1.4.5

Nov 16, 2019

1.4.4

Nov 14, 2019

1.4.3

Nov 14, 2019

1.4.2

Nov 13, 2019

1.4.1

Nov 13, 2019

1.4.0

Nov 12, 2019

1.3.9

Nov 11, 2019

1.3.8

Nov 11, 2019

1.3.7

Nov 11, 2019

1.3.6

Nov 10, 2019

1.3.5

Nov 10, 2019

1.3.4

Nov 10, 2019

1.3.3

Nov 10, 2019

1.3.2

Nov 10, 2019

1.3.1

Nov 10, 2019

1.3.0

Nov 10, 2019

1.2.9

Nov 10, 2019

1.2.8

Nov 9, 2019

1.2.7

Nov 9, 2019

1.2.6

Nov 9, 2019

1.2.5

Oct 20, 2019

1.2.4

Oct 19, 2019

1.2.3

Oct 1, 2019

1.2.2

Sep 23, 2019

1.2.1

Sep 23, 2019

1.2.0

Sep 23, 2019

1.1.9

Sep 22, 2019

1.1.8

Sep 22, 2019

1.1.7

Sep 21, 2019

1.1.6

Sep 21, 2019

1.1.5

Sep 21, 2019

1.1.4

Sep 15, 2019

1.1.3

Sep 15, 2019

1.1.2

Sep 15, 2019

1.1.1

Sep 15, 2019

1.1.0

Sep 14, 2019

1.0.5

Sep 2, 2019

1.0.4

Aug 31, 2019

1.0.3

Aug 31, 2019

1.0.2

Aug 31, 2019

1.0.1

Aug 30, 2019

1.0.0

Aug 29, 2019

0.9.9

Aug 29, 2019

0.9.8

Aug 28, 2019

0.9.7

Aug 27, 2019

0.9.6

Aug 27, 2019

0.9.5

Aug 27, 2019

0.9.4

Aug 27, 2019

0.9.3

Aug 27, 2019

0.9.2

Aug 27, 2019

0.9.1

Aug 27, 2019

0.9.0

Aug 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unidic2ud-3.0.7.tar.gz (10.1 MB view details)

Uploaded Dec 29, 2025 Source

File details

Details for the file unidic2ud-3.0.7.tar.gz.

File metadata

Download URL: unidic2ud-3.0.7.tar.gz
Upload date: Dec 29, 2025
Size: 10.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.2

File hashes

Hashes for unidic2ud-3.0.7.tar.gz
Algorithm	Hash digest
SHA256	`86917c5a20f0d02cbc84fbf70bce4546607558c216751da7f8a72a931e17cab8`
MD5	`9602afe250bdb0d084567b18d506de0b`
BLAKE2b-256	`c6ec0df57b348e343003516a0609e07420eefa9fff525a5cba893a449018e76e`

See more details on using hashes here.

unidic2ud 3.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

UniDic2UD

Basic usage

CaboCha emulator usage

Usage via spaCy

Installation for Linux

Installation for Cygwin

Installation for Jupyter Notebook (Google Colaboratory)

Benchmarks

Author

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes