UniDic2UD + COMBO-pytorch wrapper for spaCy
Project description
UniDic-COMBO
UniDic2UD + COMBO-pytorch wrapper for spaCy
Basic Usage
>>> import unidic_combo
>>> nlp=unidic_combo.load("kindai")
>>> doc=nlp("澤山居つた兄弟が一疋も見えぬ")
>>> print(unidic_combo.to_conllu(doc))
# text = 澤山居つた兄弟が一疋も見えぬ
1 澤山 沢山 ADV 副詞 _ 2 advmod _ SpaceAfter=No|Translit=タクサン
2 居つ 居る VERB 動詞-非自立可能 _ 4 acl _ SpaceAfter=No|Translit=オッ
3 た た AUX 助動詞 _ 2 aux _ SpaceAfter=No|Translit=タ
4 兄弟 兄弟 NOUN 名詞-普通名詞-一般 _ 9 nsubj _ SpaceAfter=No|Translit=キョウダイ
5 が が ADP 助詞-格助詞 _ 4 case _ SpaceAfter=No|Translit=ガ
6 一 一 NUM 名詞-数詞 _ 7 nummod _ SpaceAfter=No|Translit=イチ
7 疋 匹 NOUN 接尾辞-名詞的-助数詞 _ 9 obl _ SpaceAfter=No|Translit=ピキ
8 も も ADP 助詞-係助詞 _ 7 case _ SpaceAfter=No|Translit=モ
9 見え 見える VERB 動詞-一般 _ 0 root _ SpaceAfter=No|Translit=ミエ
10 ぬ ず AUX 助動詞 _ 9 aux _ SpaceAfter=No|Translit=ヌ
>>> import deplacy
>>> deplacy.render(doc,Japanese=True)
澤山 ADV <══╗ advmod(連用修飾語)
居つ VERB ═╗═╝<╗ acl(連体修飾節)
た AUX <╝ ║ aux(動詞補助成分)
兄弟 NOUN ═╗═══╝<╗ nsubj(主語)
が ADP <╝ ║ case(格表示)
一 NUM <╗ ║ nummod(数量による修飾語)
疋 NOUN ═╝═╗<╗ ║ obl(斜格補語)
も ADP <══╝ ║ ║ case(格表示)
見え VERB ═╗═══╝═╝ ROOT(親)
ぬ AUX <╝ aux(動詞補助成分)
>>> from deplacy.deprelja import deprelja
>>> for b in unidic_combo.bunsetu_spans(doc):
... for t in b.lefts:
... print(unidic_combo.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
澤山 -> 居つた (連用修飾語)
居つた -> 兄弟が (連体修飾節)
兄弟が -> 見えぬ (主語)
一疋も -> 見えぬ (斜格補語)
unidic_combo.load(UniDic,BERT=True)
loads spaCy Language pipeline for UniDic2UD + COMBO-pytorch. Available UniDic
options are:
UniDic="gendai"
: Use 現代書き言葉UniDic.UniDic="spoken"
: Use 現代話し言葉UniDic.UniDic="novel"
: Use 近現代口語小説UniDic.UniDic="qkana"
: Use 旧仮名口語UniDic.UniDic="kindai"
: Use 近代文語UniDic.UniDic="kinsei"
: Use 近世江戸口語UniDic.UniDic="kyogen"
: Use 中世口語UniDic.UniDic="wakan"
: Use 中世文語UniDic.UniDic="wabun"
: Use 中古和文UniDic.UniDic="manyo"
: Use 上代語UniDic.UniDic=None
: Use unidic-lite (default).
BERT=True
/BERT=False
option enables/disables to use bert-base-japanese-whole-word-masking.
Installation for Linux
pip3 install unidic_combo
Installation for Cygwin64
Make sure to get python37-devel
python37-pip
python37-cython
python37-numpy
python37-cffi
gcc-g++
mingw64-x86_64-gcc-g++
gcc-fortran
git
curl
make
cmake
libopenblas
liblapack-devel
libhdf5-devel
libfreetype-devel
libuv-devel
packages, and then:
curl -L https://raw.githubusercontent.com/KoichiYasuoka/UniDic-COMBO/master/cygwin64.sh | sh
Installation for macOS
g++ --version
pip3 install unidic_combo --user
python3 -m spacy download en_core_web_sm --user
If you fail to install Jsonnet, try below before installing UniDic-COMBO:
( echo '#! /bin/sh' ; echo 'exec gcc `echo $* | sed "s/-arch [^ ]*//g"`' ) > /tmp/clang
chmod 755 /tmp/clang
env PATH="/tmp:$PATH" pip3 install jsonnet --user
If you fail to install fugashi, try to install MeCab before installing UniDic-COMBO:
cd /tmp
git clone --depth=1 https://github.com/taku910/mecab
cd mecab/mecab
./configure --with-charset=UTF8
make && sudo make install
Benchmarks
Results of 舞姬/雪國/荒野より-Benchmarks
舞姬 | LAS | MLAS | BLEX |
---|---|---|---|
UniDic="kindai" | 84.91 | 77.78 | 85.19 |
UniDic="qkana" | 83.02 | 77.78 | 85.19 |
UniDic="kinsei" | 75.93 | 67.86 | 71.43 |
雪國 | LAS | MLAS | BLEX |
---|---|---|---|
UniDic="qkana" | 87.50 | 82.35 | 78.43 |
UniDic="kindai" | 83.19 | 78.43 | 74.51 |
UniDic="kinsei" | 78.57 | 73.08 | 69.23 |
荒野より | LAS | MLAS | BLEX |
---|---|---|---|
UniDic="kindai" | 78.53 | 59.46 | 59.46 |
UniDic="qkana" | 77.49 | 59.46 | 59.46 |
UniDic="kinsei" | 76.04 | 59.46 | 59.46 |
Reference
- 安岡孝一: TransformersのBERTは共通テスト『国語』を係り受け解析する夢を見るか, 東洋学へのコンピュータ利用, 第33回研究セミナー (2021年3月5日), pp.3-34.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file unidic_combo-1.4.3-py3-none-any.whl
.
File metadata
- Download URL: unidic_combo-1.4.3-py3-none-any.whl
- Upload date:
- Size: 72.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5990cc5bfb70501857703a15d9f72504deeaf8f4e85907192a36669652eaf7d1 |
|
MD5 | a409f15c7b10f36015b0dab4881f072a |
|
BLAKE2b-256 | 1a99e2fdb5da622c0d0687752ef5c079bf303489c8709b8e7e48d4006ed03575 |