c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.8.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ea4c4c9c1ee7b14b2371507e4708df93d8cbcf6b04b3d5bfe60ad7914c51d15 |
|
MD5 | 00e9c2bfab1860d4e77fab17d347bec9 |
|
BLAKE2b-256 | 2e2391b0e910e30ee0ed831bf05494e03231139cd192e3baf26e8841b28ec53c |
Hashes for fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc92dd8ea19a1ddef60c112539bf34852934e1aee4ded109346496f0f5a6d68f |
|
MD5 | 46acc85c2334e0c25912184d5d3380e2 |
|
BLAKE2b-256 | 0f03ec5938a59c53cad61aa2136de414abcc4c752be745fbd568fa01f686d45b |
Hashes for fast_mosestokenizer-0.0.8.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1a5ffb9281495b16c23bdbbc7705af66888b8896e3fd55d9dd6977f28a3d23f |
|
MD5 | 294935cf4be0f997272af82cb0326de4 |
|
BLAKE2b-256 | 82d3933ca0cd41340b9dd55a47a7a56820f5307815be107f266d2c00384dd0ff |
Hashes for fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3e883e9840880ca3d78b64114b75331acb1267e65f6ada444a05a34b89d6b9b |
|
MD5 | e66a2a3ee0e6cfe57779570ae4fc35da |
|
BLAKE2b-256 | 2d73f34347b4efebd81c41982c5a180eed3483c28fd0aae8362bb8cee18f2751 |
Hashes for fast_mosestokenizer-0.0.8.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bced5df75d6c4677be3d09d0151c4b2972135a5435c9d512518da4657faa03ce |
|
MD5 | 992865ca62368da9760d9f6240c384de |
|
BLAKE2b-256 | 5dff2f7058d6229798dae8e0365833baf7a8d3355ae5f8fd30ba0dd232b8247d |
Hashes for fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cacfc9b89b61cbd29664f23ab56753011ab5a50fa983578d35dc9e360ae9d616 |
|
MD5 | 3015d5356714a4146e69e0042074df90 |
|
BLAKE2b-256 | 96d70df5ac6a769d57c3d14eb5d2b4f3af594467d18e193816722aa73726da72 |