Tokenizer POS-tagger and Dependency-parser for Classical Chinese
Project description
UD-Kanbun
Tokenizer, POS-Tagger, and Dependency-Parser for Classical Chinese Texts (漢文/文言文), working on Universal Dependencies.
Basic usage
>>> import udkanbun
>>> lzh=udkanbun.load()
>>> s=lzh("不入虎穴不得虎子")
>>> print(s)
# text = 不入虎穴不得虎子
1 不 不 ADV v,副詞,否定,無界 Polarity=Neg 2 advmod _ Gloss=not|SpaceAfter=No
2 入 入 VERB v,動詞,行為,移動 _ 6 advcl _ Gloss=enter|SpaceAfter=No
3 虎 虎 NOUN n,名詞,主体,動物 _ 4 nmod _ Gloss=tiger|SpaceAfter=No
4 穴 穴 NOUN n,名詞,固定物,地形 Case=Loc 2 obj _ Gloss=cave|SpaceAfter=No
5 不 不 ADV v,副詞,否定,無界 Polarity=Neg 6 advmod _ Gloss=not|SpaceAfter=No
6 得 得 VERB v,動詞,行為,得失 _ 0 root _ Gloss=get|SpaceAfter=No
7 虎 虎 NOUN n,名詞,主体,動物 _ 8 nmod _ Gloss=tiger|SpaceAfter=No
8 子 子 NOUN n,名詞,人,関係 _ 6 obj _ Gloss=child|SpaceAfter=No
>>> t=s[1]
>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)
1 不 不 ADV v,副詞,否定,無界 Polarity=Neg 2 advmod _ Gloss=not|SpaceAfter=No
>>> print(s.to_tree())
不 <┐ advmod
入 ─┴─┐<┐ advcl
虎 <┐ │ │ nmod
穴 ─┘<┘ │ obj
不 <┐ │ advmod
得 ─┴─┬─┘ root
虎 <┐ │ nmod
子 ─┘<┘ obj
>>> f=open("trial.svg","w")
>>> f.write(s.to_svg())
>>> f.close()
udkanbun.load()
has only one option udkanbun.load(MeCab=False)
. By default, the UD-Kanbun pipeline uses MeCab for tokenizer and POS-tagger, then uses UDPipe for dependency-parser. With the option MeCab=False
the pipeline uses UDPipe for all through the processing.
to_tree()
has an option to_tree(BoxDrawingWidth=2)
for old terminals, whose Box Drawing characters are "fullwidth".
You can simply use udkanbun
on the command line:
echo 不入虎穴不得虎子 | udkanbun
Installation for Linux
Binary wheel is available for Linux, and is installed by default when you use pip
:
pip install udkanbun
Installation for Cygwin64
For installing in Cygwin64, make sure to get gcc-g++
git
python37-pip
python37-devel
swig
packages, and then:
pip3.7 install git+https://github.com/KoichiYasuoka/mecab-cygwin64
pip3.7 install udkanbun
Use python3.7
command in Cygwin64 instead of python
. For installing in old Cygwin (32-bit), try to use mecab-cygwin32 instead of mecab-cygwin64.
Installation for Jupyter Notebook (Google Colaboratory)
!pip install udkanbun
Author
Koichi Yasuoka (安岡孝一)
References
- 安岡孝一: 四書を学んだMeCab+UDPipeはセンター試験の漢文を読めるのか, 東洋学へのコンピュータ利用, 第30回研究セミナー (2019年3月8日), pp.3-110.
- Koichi Yasuoka: Universal Dependencies Treebank of the Four Books in Classical Chinese, DADH2019: 10th International Conference of Digital Archives and Digital Humanities (December 2019).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.