Skip to main content

No project description provided

Project description

jagger-python

Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer) : https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/index.en.html

Install

$ python -m pip install jagger

This does not install model files.

You can download precompiled KWDLC model from https://github.com/lighttransport/jagger-python/releases/download/v0.1.0/model_kwdlc.tar.gz (Note that KWDLC has unclear license/TermOfUse. Use it at your own risk)

Example

import jagger

model_path = "model/kwdlc/patterns"

tokenizer = jagger.Jagger()
tokenizer.load_model(model_path)

text = "吾輩は猫である。名前はまだない。"
toks = tokenizer.tokenize(text)

for tok in toks:
    print(tok.surface(), tok.feature())
print("EOL")

"""
吾輩    名詞,普通名詞,*,*,吾輩,わがはい,代表表記:我が輩/わがはい カテゴリ:人
は      助詞,副助詞,*,*,は,は,*
猫      名詞,普通名詞,*,*,猫,ねこ,*
である  判定詞,*,判定詞,デアル列基本形,だ,である,*
。      特殊,句点,*,*,。,。,*
名前    名詞,普通名詞,*,*,名前,なまえ,*
は      助詞,副助詞,*,*,は,は,*
まだ    副詞,*,*,*,まだ,まだ,*
ない    形容詞,*,イ形容詞アウオ段,基本形,ない,ない,*
。      特殊,句点,*,*,。,。,*
"""

# print tags
for tok in toks:
    # print tag(split feature() by comma)
    print(tok.surface())
    for i in range(tok.n_tags()):
        print("  tag[{}] = {}".format(i, tok.tag(i)))
print("EOL")

Batch processing(experimental)

tokenize_batch tokenizes multiple lines(delimited by newline('\n', '\r', or '\r\n')) at once. Splitting lines is done in C++ side.

import jagger

model_path = "model/kwdlc/patterns"

tokenizer = jagger.Jagger()
tokenizer.load_model(model_path)

text = """
吾輩は猫である。
名前はまだない。
明日の天気は晴れです。
"""

# optional: set C++ threads(CPU cores) to use
# default: Use all CPU cores.
# tokenizer.set_threads(4)

toks_list = tokenizer.tokenize_batch(text)

for toks in toks_list:
    for tok in toks:
        print(tok.surface(), tok.feature())

Train a model.

Pyhthon interface for training a model is not provided yet. For a while, you can build C++ trainer cli using CMake(Windows supported). See train/ for details.

Limitation

Single line string must be less than 262,144 bytes(~= 87,000 UTF-8 Japanese chars).

Jagger version

Jagger version used in this Python binding is

2023-02-18

For developer

Edit dev_mode=True in to enable asan + debug build

Run python script with

$ LD_PRELOAD=$(gcc -print-file-name=libasan.so) python FILE.py

or

$ LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) python FILE.py

TODO

  • Provide a model file trained from Wikipedia, UniDic, etc(clearer & permissive licencing&TermOfUse).
    • Use GiNZA for morphological analysis.
  • Split feature vector(CSV) considering quote char when extracting tags.
    • e.g. 'a,b,"c,d",e' => ["a", "b", "c,d", "e"]

License

Python binding is available under 2-clause BSD licence.

Jagger and ccedar_core.h is licensed under GPLv2/LGPLv2.1/BSD triple licenses.

Third party licences

  • stack_container.h: BSD like license.
  • nanocsv.h MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jagger-0.1.9.tar.gz (37.7 kB view hashes)

Uploaded Source

Built Distributions

jagger-0.1.9-cp311-cp311-win_arm64.whl (141.2 kB view hashes)

Uploaded CPython 3.11 Windows ARM64

jagger-0.1.9-cp311-cp311-win_amd64.whl (144.8 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

jagger-0.1.9-cp311-cp311-win32.whl (132.6 kB view hashes)

Uploaded CPython 3.11 Windows x86

jagger-0.1.9-cp311-cp311-musllinux_1_1_x86_64.whl (704.0 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

jagger-0.1.9-cp311-cp311-musllinux_1_1_i686.whl (765.0 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

jagger-0.1.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (187.1 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

jagger-0.1.9-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (195.0 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

jagger-0.1.9-cp311-cp311-macosx_11_0_arm64.whl (150.7 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

jagger-0.1.9-cp311-cp311-macosx_10_9_x86_64.whl (157.5 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

jagger-0.1.9-cp311-cp311-macosx_10_9_universal2.whl (266.9 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

jagger-0.1.9-cp310-cp310-win_arm64.whl (140.1 kB view hashes)

Uploaded CPython 3.10 Windows ARM64

jagger-0.1.9-cp310-cp310-win_amd64.whl (143.5 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

jagger-0.1.9-cp310-cp310-win32.whl (131.9 kB view hashes)

Uploaded CPython 3.10 Windows x86

jagger-0.1.9-cp310-cp310-musllinux_1_1_x86_64.whl (703.0 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

jagger-0.1.9-cp310-cp310-musllinux_1_1_i686.whl (763.7 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

jagger-0.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (185.4 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

jagger-0.1.9-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (193.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

jagger-0.1.9-cp310-cp310-macosx_11_0_arm64.whl (149.4 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

jagger-0.1.9-cp310-cp310-macosx_10_9_x86_64.whl (155.9 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

jagger-0.1.9-cp310-cp310-macosx_10_9_universal2.whl (263.7 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

jagger-0.1.9-cp39-cp39-win_arm64.whl (140.1 kB view hashes)

Uploaded CPython 3.9 Windows ARM64

jagger-0.1.9-cp39-cp39-win_amd64.whl (143.6 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

jagger-0.1.9-cp39-cp39-win32.whl (131.8 kB view hashes)

Uploaded CPython 3.9 Windows x86

jagger-0.1.9-cp39-cp39-musllinux_1_1_x86_64.whl (703.3 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

jagger-0.1.9-cp39-cp39-musllinux_1_1_i686.whl (764.2 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

jagger-0.1.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (185.5 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

jagger-0.1.9-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (194.4 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

jagger-0.1.9-cp39-cp39-macosx_11_0_arm64.whl (149.6 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

jagger-0.1.9-cp39-cp39-macosx_10_9_x86_64.whl (156.1 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

jagger-0.1.9-cp39-cp39-macosx_10_9_universal2.whl (264.1 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

jagger-0.1.9-cp38-cp38-win_amd64.whl (143.4 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

jagger-0.1.9-cp38-cp38-win32.whl (131.7 kB view hashes)

Uploaded CPython 3.8 Windows x86

jagger-0.1.9-cp38-cp38-musllinux_1_1_x86_64.whl (702.8 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

jagger-0.1.9-cp38-cp38-musllinux_1_1_i686.whl (763.8 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

jagger-0.1.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (185.3 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

jagger-0.1.9-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (193.6 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

jagger-0.1.9-cp38-cp38-macosx_11_0_arm64.whl (149.3 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

jagger-0.1.9-cp38-cp38-macosx_10_9_x86_64.whl (155.9 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

jagger-0.1.9-cp38-cp38-macosx_10_9_universal2.whl (263.5 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64)

jagger-0.1.9-cp37-cp37m-win_amd64.whl (144.5 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

jagger-0.1.9-cp37-cp37m-win32.whl (131.9 kB view hashes)

Uploaded CPython 3.7m Windows x86

jagger-0.1.9-cp37-cp37m-musllinux_1_1_x86_64.whl (705.5 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

jagger-0.1.9-cp37-cp37m-musllinux_1_1_i686.whl (767.8 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ i686

jagger-0.1.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (188.3 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

jagger-0.1.9-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (199.0 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686

jagger-0.1.9-cp37-cp37m-macosx_10_9_x86_64.whl (155.0 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page