No project description provided
Project description
jagger-python
Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer) : https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/index.en.html
Install
$ python -m pip install jagger
This does not install model files.
You can download precompiled KWDLC model from https://github.com/lighttransport/jagger-python/releases/download/v0.1.0/model_kwdlc.tar.gz (Note that KWDLC has unclear license/TermOfUse. Use it at your own risk)
Example
import jagger
model_path = "model/kwdlc/patterns"
tokenizer = jagger.Jagger()
tokenizer.load_model(model_path)
text = "吾輩は猫である。名前はまだない。"
toks = tokenizer.tokenize(text)
for tok in toks:
print(tok.surface(), tok.feature())
print("EOL")
"""
吾輩 名詞,普通名詞,*,*,吾輩,わがはい,代表表記:我が輩/わがはい カテゴリ:人
は 助詞,副助詞,*,*,は,は,*
猫 名詞,普通名詞,*,*,猫,ねこ,*
である 判定詞,*,判定詞,デアル列基本形,だ,である,*
。 特殊,句点,*,*,。,。,*
名前 名詞,普通名詞,*,*,名前,なまえ,*
は 助詞,副助詞,*,*,は,は,*
まだ 副詞,*,*,*,まだ,まだ,*
ない 形容詞,*,イ形容詞アウオ段,基本形,ない,ない,*
。 特殊,句点,*,*,。,。,*
"""
# print tags
for tok in toks:
# print tag(split feature() by comma)
print(tok.surface())
for i in range(tok.n_tags()):
print(" tag[{}] = {}".format(i, tok.tag(i)))
print("EOL")
Batch processing(experimental)
tokenize_batch
tokenizes multiple lines(delimited by newline('\n', '\r', or '\r\n')) at once.
Splitting lines is done in C++ side.
import jagger
model_path = "model/kwdlc/patterns"
tokenizer = jagger.Jagger()
tokenizer.load_model(model_path)
text = """
吾輩は猫である。
名前はまだない。
明日の天気は晴れです。
"""
# optional: set C++ threads(CPU cores) to use
# default: Use all CPU cores.
# tokenizer.set_threads(4)
toks_list = tokenizer.tokenize_batch(text)
for toks in toks_list:
for tok in toks:
print(tok.surface(), tok.feature())
Train a model.
Pyhthon interface for training a model is not provided yet.
For a while, you can build C++ trainer cli using CMake(Windows supported).
See train/
for details.
Limitation
Single line string must be less than 262,144 bytes(~= 87,000 UTF-8 Japanese chars).
Jagger version
Jagger version used in this Python binding is
2023-02-18
For developer
Edit dev_mode=True
in to enable asan + debug build
Run python script with
$ LD_PRELOAD=$(gcc -print-file-name=libasan.so) python FILE.py
or
$ LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) python FILE.py
Releasing
Version is created automatically using setuptools_scm
.
- tag it:
git tag vX.Y.Z
- push tag:
git push --tags
TODO
- Provide a model file trained from Wikipedia, UniDic, etc(clearer & permissive licencing&TermOfUse).
- Use GiNZA for morphological analysis.
- Split feature vector(CSV) considering quote char when extracting tags.
- e.g. 'a,b,"c,d",e' => ["a", "b", "c,d", "e"]
- Optimize C++ <-> Python interface
- string_view(or read-only string literal) for tag str.
- pickle support(for exchanging Python object when using multiprocessing)
License
Python binding is available under 2-clause BSD licence.
Jagger and ccedar_core.h
is licensed under GPLv2/LGPLv2.1/BSD triple licenses.
Third party licences
- stack_container.h: BSD like license.
- nanocsv.h MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for jagger-0.1.18-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc9d95af77ec0ffb7dd3beab1b9b17b939e520affd490e2e2ba427fd5a40d3c7 |
|
MD5 | 9f6a36542c072e7c85e4b9b57f5c8301 |
|
BLAKE2b-256 | ade6ba46e6717ca0dd77a218d1a7b810768d51b9ad5626431c541ccdc936ea66 |
Hashes for jagger-0.1.18-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84e84b16c9af2680bb99f9be60c869f52c3a29254190ae30c7e4b7247e7e9260 |
|
MD5 | b3a355a57467da402eb63f027ea3e1b7 |
|
BLAKE2b-256 | 79ed9f44000457f7ea53d1ced165c4e58ceca48867cebd07cc46628da3b29670 |
Hashes for jagger-0.1.18-cp37-cp37m-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf5f69db34b1d64163318160dc87ec2686614d3b60e33e72877a11ad6aea496b |
|
MD5 | 6b9dd8d06b3533db7a82911afc787fe2 |
|
BLAKE2b-256 | fc7cb636b50df6d4697347d4efcb736d66f0d35aa96ab082ffe0b703d5887bbb |
Hashes for jagger-0.1.18-cp37-cp37m-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc7a632b83263164940d65a74a3310a52b029f5dd8409ba9bd1835dfb242ee7f |
|
MD5 | ca0f83583e2a315f307d19148e8fe2fa |
|
BLAKE2b-256 | 852072f6b394ef6d991b2447054f8d6aecccc2b24aad2dac3a06243e4aac97a8 |
Hashes for jagger-0.1.18-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d5078ec38ac4fcea24f921e4c8fad53f0ed77e22d4bb56756feafe590bff890 |
|
MD5 | 89d098674a026e5a3c3f95088973438e |
|
BLAKE2b-256 | 47c2e42e0183f1ec41f5bd26c06f63149a353f8fe65923cb0629d1c9c3dea275 |
Hashes for jagger-0.1.18-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | edc9ad70bff7a223744b00a53a59f63216d370fe5eb0ed0f8c3c53bc61d178d1 |
|
MD5 | 07f460de81f02a0f295ff0adb176331e |
|
BLAKE2b-256 | 34fa5d4af0136408ad18e58497a42470f1161c4a5de2dbd2be831e6982735569 |
Hashes for jagger-0.1.18-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58b3cfc5d75fd9db8c6ee919b14b7bccbdb0c083580055171d7a98b7546d7f77 |
|
MD5 | af8b0e24d6c6923f881729f5b4fab05c |
|
BLAKE2b-256 | 6291bc2f2bded4683de9960cf77c683711e33f593ce856f565563dd182e1cc18 |
Hashes for jagger-0.1.18-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78e8ac55eab358e19ff7db3c402bd1c802c8bf0abb510388f6565c838f13cb01 |
|
MD5 | b7bb1f276eceda57f5c543adae1cd4c0 |
|
BLAKE2b-256 | 58246c60051e69e85039aba0de260d8d593a397d77db965be03a456536990afd |
Hashes for jagger-0.1.18-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8bc8377c3457666dae612423af5a4944ffa6fd3fbed7de973ac636d76e7c85f |
|
MD5 | 048e097be2a8b74bbf0402b904acf4a4 |
|
BLAKE2b-256 | da3f9c330efe777de56d0688399fdc48c845084fc66fbd9beec47ffb8174bb1e |
Hashes for jagger-0.1.18-cp36-cp36m-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5326c66c344685e21c64d7ed9a7c69aae1a59cd76fef3e5e5510d490f85a31cb |
|
MD5 | d62e66627d324ae549c74bc52eceaf9d |
|
BLAKE2b-256 | 7367f97be362bfe24c7b1202e35d4f95d468de28820049412789d586e917d606 |
Hashes for jagger-0.1.18-cp36-cp36m-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 885a803608bc16d5f4c53230608e3818e3d281e2f5a8372de1f92cd1a068d582 |
|
MD5 | b879be68cc921398ece2dc66b7034ada |
|
BLAKE2b-256 | 488a1c21e9a7e8f48f34989097c71e919f82efabd0f7868fb3eec8d94c373d3c |
Hashes for jagger-0.1.18-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33aae0b243ab8a22b10531e0d274e01f46c6b2fd41048bceb9325f8199f23fc2 |
|
MD5 | 9b19f2eb20da4497e5945547759265e2 |
|
BLAKE2b-256 | f50d5cb29880ca4adfe6eebbd0c1e133a06a3f1ee9b697ef94823be8cf5c2726 |
Hashes for jagger-0.1.18-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5ce37d4ae998eb793bc65bd75d73272c3da499918842f22c22e9d85b7a3d959 |
|
MD5 | 5dbe192fde3edb874d9c062e495ecfe1 |
|
BLAKE2b-256 | f4a50eac5480155f1599d5941ebd05e18ee06f3e2ffb33a807b767ad14359893 |
Hashes for jagger-0.1.18-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ab2169c59cd400b98653c73a9d3f864da6a868f6d7629351432d0c3b4b23fac |
|
MD5 | 24c0805c5a5871a8eead4c7de06eff56 |
|
BLAKE2b-256 | 7cd757c9ab506e9eb14e4c933bf2f1aaeb8476a2baead58080d782328f7f27aa |