c++ mosestokenizer
Project description
opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for opus-fast-mosestokenizer-0.0.8.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4b653fb22334af30649a7a3c76792fafad97c8f316ba9a6a53f6c2f0707e720 |
|
MD5 | c1f41dcd297a19f3781b95fd5ea23343 |
|
BLAKE2b-256 | d70b55cff028446f61813fc2e5d3d50c7ba7bf402c25a16899e547e741198ece |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp311-cp311-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa693333e98cb6277eb702be36b1e93a74001e2a032985445a320c502773811d |
|
MD5 | 92a5b8f93f78d6d14631d9de5715ac19 |
|
BLAKE2b-256 | dc45ba44fd4284d2f464ba1eab43850b18e070f912f13943ef1d88514d972738 |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e9e67dbc63a1c38af981f2cdb24b85e18cb1880b61e3ec57a084c8b5ab6b8d9 |
|
MD5 | eac3ad4fda0fd4dc0f5a9799d753a4bb |
|
BLAKE2b-256 | 42b2770073898e2ffe2d2be4656f756e3116278b76ee731fbbc40a4111313baa |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp310-cp310-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8eeac5d174abe6e32c6f3b3830369aafa22c27b9008aa10d5611cbe865386025 |
|
MD5 | 7e6b57f68b3f5a93ab67e2c692a1a166 |
|
BLAKE2b-256 | e84ab348e3036ae5d4d4b8d43308b929993b741563d0c226bee172c77d2a9324 |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1e20514411d70abab57e30102fccac1df4e1b26016c21d6d2adcf5a14f9d97d |
|
MD5 | 9c818deb5a89149e39a848b284f43d39 |
|
BLAKE2b-256 | 0821821370214c4d05119b1083d2d31678dc3edb61f4b9a8002926908f81e66d |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp39-cp39-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b19bb8e01c31c62814d9a9aef2bea41ad6de8b23cfde7279bc0bc6bc6fab5c0 |
|
MD5 | 43ec50d8a4760d7d66d316b9d1b9b629 |
|
BLAKE2b-256 | 2135a87a7c4e7ac24a64c95eaa64cd85a93e309e11471399bd08470f7ae2bbc5 |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c563a02c5f753d2f7f44ebd6ddfc3a6c368b538568567629e5b8e876c2967a9d |
|
MD5 | c32b31ff7d920e93a108b8171f43ad5a |
|
BLAKE2b-256 | 7ed075ab81e766d352329a0b6380e59cbfa70c992c3cfcb8fe363cd60d591cc6 |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d3c7df00c09a362fdfacfb15cbff83e8d039b6aa6b129824610a75046430e86 |
|
MD5 | 581a756a41eba756916625d5a2942577 |
|
BLAKE2b-256 | c201c34a8ce2c70f88d399a0d96147ea4b6b639c32d8b590fd16f40efa7906d4 |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d22e5a684a4f7b50ea3c5f88d80bf0a628a446df67d307be74d72f8c668fc37e |
|
MD5 | 510947e2a03c6fcf336ed0611aeeca7f |
|
BLAKE2b-256 | 5fd1d0d34b3895ee39e88f3a8cce5ce72b18cc9e37bec7bca17a2eec7a7e0ffa |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4f02e53a508a95d82ac868c03647c4472208074c2b6a308b08266caa2d085d3 |
|
MD5 | 110e75cca318896eb88fe68461837c8a |
|
BLAKE2b-256 | 1f0e52e241860f14593360130bbfcc78f27c5065d3d42ed126b4b8fb6d019e6e |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp37-cp37m-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c82fc1cabe4ebe6fa59dae115ec5f55fec9099f460eb302cb9c22e3a88ccdd01 |
|
MD5 | 52a7acc626270e7e15dabbb96e7202f9 |
|
BLAKE2b-256 | 9009e40aa8515aec471841f4324403a6dae46da47611c3220b593df8be7b5a58 |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2fc28ddfecf00f80022186f6f703ccb76c0c794c5e39248758d0b1bcffd30c84 |
|
MD5 | 4de2bb39da88b949e2a1ae0d9de97623 |
|
BLAKE2b-256 | b34077c8aa9872c9e253e2d565c37ed76f26fa81be58160ee11cc3436393c7ae |
Hashes for opus_fast_mosestokenizer-0.0.8.5-cp36-cp36m-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | db1a7ba69c65f08a30e3e4c8e5bad59a7a56b6023c29a5600be0e056cfcd25a6 |
|
MD5 | c3e40796afdca6f4458b2ed2d0ffa075 |
|
BLAKE2b-256 | de942f5cac07e2cfc422f326683ff1107b2c39af128c4a23fca0ebcbfe6be6ba |