Skip to main content

Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese

Project description

Current PyPI packages

UniDic2UD

Tokenizer, POS-tagger, lemmatizer, and dependency-parser for modern and contemporary Japanese, working on Universal Dependencies.

Basic usage

>>> import unidic2ud
>>> qkana=unidic2ud.load("qkana")
>>> s=qkana("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s)
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1		其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2			NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3			ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5			AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=
6			ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8			NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9			ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11		其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12			NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13			ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=s[7]
>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)
7 欲する 欲する VERB 動詞-一般 _ 8 acl _ SpaceAfter=No|Translit=ホッスル

>>> f=open("trial.svg","w")
>>> f.write(s.to_svg())
>>> f.close()

trial.svg

unidic2ud.load(UniDic,UDPipe) loads a natural language processor pipeline, which uses UniDic for tokenizer POS-tagger and lemmatizer, then uses UDPipe for dependency-parser. Available UniDic options are:

The default UDPipe is UDPipe="japanese-gsd" from Universal Dependecies 2.4 Models.

CaboCha emulator usage

>>> import unidic2ud.cabocha as CaboCha
>>> qkana=CaboCha.Parser("qkana")
>>> s=qkana.parse("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s.toString(CaboCha.FORMAT_TREE_LATTICE)) 
  -D
  國を-D
治めんと-D
    欲する-D
        者は-------D
          先づ-----D
              -D |
              家を-D
                齊ふ
EOS
* 0 1D 0/0 0.000000
	連体詞,*,*,*,*,*,其の,ソノ,*,DET	1<-det-2
* 1 2D 0/1 0.000000
	名詞,普通名詞,一般,*,*,*,,クニ,*,NOUN	2<-obj-4
	助詞,格助詞,*,*,*,*,,,*,ADP	3<-case-2
* 2 3D 0/1 0.000000
治め	動詞,一般,*,*,*,*,収める,オサメ,*,VERB	4<-advcl-7
	助動詞,*,*,*,*,*,,,*,AUX	5<-aux-4
	助詞,格助詞,*,*,*,*,,,*,ADP	6<-case-4
* 3 4D 0/0 0.000000
欲する	動詞,一般,*,*,*,*,欲する,ホッスル,*,VERB	7<-acl-8
* 4 8D 0/1 0.000000
	名詞,普通名詞,一般,*,*,*,,モノ,*,NOUN	8<-nsubj-14
	助詞,係助詞,*,*,*,*,,,*,ADP	9<-case-8
* 5 8D 0/0 0.000000
先づ	副詞,*,*,*,*,*,先ず,マヅ,*,ADV	10<-advmod-14
* 6 7D 0/0 0.000000
	連体詞,*,*,*,*,*,其の,ソノ,*,DET	11<-det-12
* 7 8D 0/1 0.000000
	名詞,普通名詞,一般,*,*,*,,ウチ,*,NOUN	12<-obj-14
	助詞,格助詞,*,*,*,*,,,*,ADP	13<-case-12
* 8 -1D 0/0 0.000000
齊ふ	動詞,一般,*,*,*,*,整える,トトノフ,*,VERB	14<-root
EOS

CaboCha.Parser(UniDic) is an alias for unidic2ud.load(UniDic,UDPipe="japanese-gsd"), and its default is "ipadic". CaboCha.Tree.toString(format) has five available formats:

  • CaboCha.FORMAT_TREE: tree (numbered as 0)
  • CaboCha.FORMAT_LATTICE: lattice (numbered as 1)
  • CaboCha.FORMAT_TREE_LATTICE: tree + lattice (numbered as 2)
  • CaboCha.FORMAT_XML: XML (numbered as 3)
  • CaboCha.FORMAT_CONLL: Universal Dependencies CoNLL-U (numbered as 4)

You can simply use udcabocha on the command line:

echo 其國を治めんと欲する者は先づ其家を齊ふ | udcabocha -U qkana -f 2

-U UniDic specifies UniDic (default is -U ipadic). -f format specifies the output format in 0 to 4 (default is -f 0).

Usage via spaCy

If you have already installed spaCy 2.1.0 or later, you can use UniDic via spaCy Language pipeline.

>>> import unidic2ud.spacy
>>> qkana=unidic2ud.spacy.load("qkana")
>>> d=qkana("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(type(d))
<class 'spacy.tokens.doc.Doc'>
>>> print(unidic2ud.spacy.to_conllu(d))
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1		其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2			NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3			ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5			AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=
6			ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8			NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9			ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11		其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12			NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13			ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=d[6]
>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)
7 欲する 欲する VERB 動詞-一般 8 acl  ホッスル

Installation for Linux

Binary wheel is available for Linux, and is installed by default when you use pip:

pip install unidic2ud

By default installation, UniDic and UDPipe are invoked through Web APIs. If you want to invoke them locally and faster, you can download UniDic and UDPipe which you use just as follows:

python -m unidic2ud download.unidic qkana
python -m unidic2ud download.udpipe japanese-gsd
python -m unidic2ud dictlist

Licenses of dictionaries and models are: GPL/LGPL/BSD for gendai and spoken; CC BY-SA 4.0 for japanese-gsd; CC BY-NC-SA 4.0 for others.

Installation for Cygwin64

For installing in Cygwin64, make sure to get gcc-g++ git python37-pip python37-devel swig packages, and then:

pip3.7 install git+https://github.com/KoichiYasuoka/mecab-cygwin64
pip3.7 install unidic2ud

Use python3.7 command in Cygwin64 instead of python (even for downloading dictionaries). For installing in old Cygwin (32-bit), try to use mecab-cygwin32 instead of mecab-cygwin64.

Author

Koichi Yasuoka (安岡孝一)

References

  • 安岡孝一: 漢日英Universal Dependencies平行コーパスとその差異, 人文科学とコンピュータシンポジウム「じんもんこん2019」論文集 (2019年12月).
  • Koichi Yasuoka: Universal Dependencies Parallel Corpora on Classical Chinese, Modern Japanese, and Modern English. Jinmoncom 2019: IPSJ Symposium Series, Vol.2019 (December 2019).

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

unidic2ud-1.3.5-py3-none-any.whl (18.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page