Skip to main content

Python wrapper of Vaporetto tokenizer

Project description

🐍 python-vaporetto 🛥

Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

PyPI Build Status Documentation Status

Installation

Install pre-built package from PyPI

Run the following command:

$ pip install vaporetto

Build from source

You need to install the Rust compiler following the documentation beforehand. daachorse uses pyproject.toml, so you also need to upgrade pip to version 19 or later.

$ pip install --upgrade pip

After setting up the environment, you can install daachorse as follows:

$ pip install git+https://github.com/daac-tools/python-vaporetto

Example Usage

python-vaporetto does not contain model files. To perform tokenization, follow the document of Vaporetto to download distribution models or train your own models beforehand.

Check the version number as shown below to use compatible models:

>>> import vaporetto
>>> vaporetto.VAPORETTO_VERSION
'0.6.3'

Examples:

# Import vaporetto module
>>> import vaporetto

# Load the model file
>>> with open('tests/data/vaporetto.model', 'rb') as fp:
...     model = fp.read()

# Create an instance of the Vaporetto
>>> tokenizer = vaporetto.Vaporetto(model, predict_tags = True)

# Tokenize
>>> tokenizer.tokenize_to_string('まぁ社長は火星猫だ')
'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'

>>> tokens = tokenizer.tokenize('まぁ社長は火星猫だ')

>>> len(tokens)
6

>>> tokens[0].surface()
'まぁ'

>>> tokens[0].tag(0)
'名詞'

>>> tokens[0].tag(1)
'マー'

>>> [token.surface() for token in tokens]
['まぁ', '社長', 'は', '火星', '猫', 'だ']

Note for distributed models

The distributed models are compressed in zstd format. If you want to load these compressed models, you must decompress them outside the API.

>>> import vaporetto
>>> import zstandard  # zstandard package in PyPI

>>> dctx = zstandard.ZstdDecompressor()
>>> with open('tests/data/vaporetto.model.zst', 'rb') as fp:
...    with dctx.stream_reader(fp) as dict_reader:
...        tokenizer = vaporetto.Vaporetto(dict_reader.read(), predict_tags = True)

Note for KyTea's models

You can also use KyTea's models as follows:

>>> with open('path/to/jp-0.4.7-5.mod', 'rb') as fp:  # doctest: +SKIP
...     tokenizer = vaporetto.Vaporetto.create_from_kytea_model(fp.read())

Note: Vaporetto does not support tag prediction with KyTea's models.

Speed Comparison

License

Licensed under either of

at your option.

Contribution

See the guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vaporetto-0.3.0.tar.gz (411.3 kB view hashes)

Uploaded Source

Built Distributions

vaporetto-0.3.0-cp311-none-win_amd64.whl (299.9 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

vaporetto-0.3.0-cp311-none-win32.whl (281.6 kB view hashes)

Uploaded CPython 3.11 Windows x86

vaporetto-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

vaporetto-0.3.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.5+ i686

vaporetto-0.3.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (874.7 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

vaporetto-0.3.0-cp310-none-win_amd64.whl (299.9 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

vaporetto-0.3.0-cp310-none-win32.whl (281.6 kB view hashes)

Uploaded CPython 3.10 Windows x86

vaporetto-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

vaporetto-0.3.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.5+ i686

vaporetto-0.3.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (874.7 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

vaporetto-0.3.0-cp39-none-win_amd64.whl (300.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

vaporetto-0.3.0-cp39-none-win32.whl (281.9 kB view hashes)

Uploaded CPython 3.9 Windows x86

vaporetto-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

vaporetto-0.3.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.5+ i686

vaporetto-0.3.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (875.2 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

vaporetto-0.3.0-cp38-none-win_amd64.whl (300.0 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

vaporetto-0.3.0-cp38-none-win32.whl (281.4 kB view hashes)

Uploaded CPython 3.8 Windows x86

vaporetto-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

vaporetto-0.3.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.5+ i686

vaporetto-0.3.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (873.7 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

vaporetto-0.3.0-cp37-none-win_amd64.whl (300.0 kB view hashes)

Uploaded CPython 3.7 Windows x86-64

vaporetto-0.3.0-cp37-none-win32.whl (281.4 kB view hashes)

Uploaded CPython 3.7 Windows x86

vaporetto-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

vaporetto-0.3.0-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.5+ i686

vaporetto-0.3.0-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (873.8 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page