c++ mosestokenizer
Project description
opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for opus_fast_mosestokenizer-0.0.8.7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6def802ea359bc56f39573bf3580472759bee356f60079bb900dc3a0f039b82c |
|
MD5 | bf04d689918d5622c26b7aa2710a7a37 |
|
BLAKE2b-256 | 10e586be04a7a3b95ea6abc7c4dc1e30209f36d15d29e3a6fbf590590cc532c9 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2d00412d0400383597f9c6ca472a3243f2f97520f28905ab3edb67499749261 |
|
MD5 | 905f00c725c14d69c5f4814c45631c74 |
|
BLAKE2b-256 | 8770f0251d40a3e9d90d9bb609b30849018ec194927765ff6fca98d18b7247aa |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f895d1238c44048d5de76f9b8c8b87f3a5164784b48c4c20f8d67295538ed931 |
|
MD5 | 10846f545fe6159c30609a42336c4178 |
|
BLAKE2b-256 | 341c30784f6c5293c1aa23acba44be1a1cf272c9206b86d58111d6699f8534b3 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65ae245b2e20a2444253fe4cdd3c97507f2ed6ccef5227304d3c1a0d45f7fde1 |
|
MD5 | fef8d042fb71aa7c57f1747d6d5adb2e |
|
BLAKE2b-256 | ea993dca66ef3144260f8cddc747b21b3ac19e4a9d65e30c0bf8a414df9be2b6 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa715afafc09f7e25fefc047fdaa50d9061b59422b4f5306e38677fd673b4b4a |
|
MD5 | 2f0fe60ca1c3424011b03b03563f6fd9 |
|
BLAKE2b-256 | 2993df7d270b768297c0fec921478839e2a452fe7a722348f49f6353f50c494f |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 150c76d23c5f0dcf79e1bfabe41a8d1341513774787542cd564df01e9afabeec |
|
MD5 | a65d3df5a4a8734c2637a271b8144179 |
|
BLAKE2b-256 | 23ea48b13e41b77384fa7ffe0238f02f717837587e6719b82ae6f4686dc08b0d |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ac5b20ab98b778712e1518047dc4164a9080b1cff8ff5348c44e6f7459728c9 |
|
MD5 | e0ecc5526dfb09dd852b3631ca3e68d1 |
|
BLAKE2b-256 | c34dda56cec8159bedb7ceef6e8e72a51a692ce15f40b1de66b1ac555cc38c90 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44bea21906090519f6f2da7ba221a3016c5b737d28eae52d7a60c487d74f446e |
|
MD5 | 3c0b65b84e8dc3716d4c5a8a43bca508 |
|
BLAKE2b-256 | a4f8fad76fa103e577276970c1a6e1d553be3d2ba3f397677dcb7e29b64f9b87 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78ae123a802b58d453637c3afdad09a23bd5d15d88ab1575ad77cdcf742a12d4 |
|
MD5 | de58b14c9e3240980db6a0cb85034002 |
|
BLAKE2b-256 | cca9f52913bffb435f34ba5b9761bc8e1e9793d4b19f64561ef6b34f47586cf5 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c85262b04eeeb31a1c2e34039913c6dd880c321b6ab59b6a26dd3ea4f001a1b |
|
MD5 | 70e62ef8ba4a15a7416198f645437a39 |
|
BLAKE2b-256 | 62ccd289beabeac3c59c1d309749bf3811d73a713ad98eef3157533a15f4e80d |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e771804ffd7b0109c86d54c1b91ed25226c29b0056ba1f508bd42fb899c29bfa |
|
MD5 | dd3f034e6f691a8192c563904d111871 |
|
BLAKE2b-256 | 3b3c31e52030796a9bea2589ed4249a579ef391bb840abec4e29dcabeac911e8 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50bd40363a6f2cae719279bd18297b0cba5f81a31fea458417f443996f7ee98c |
|
MD5 | 2a0e4a0b8b98d6d717c8fc491961ab68 |
|
BLAKE2b-256 | c6dfeab3565b7c2a12c21719341a55b2f5890b5808bc3a13670ff8c49cc56a27 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7326631d8669e4ff19b92e54a2c0c6f9ae04e5b28b6d129af88a672ae934c721 |
|
MD5 | 7801248672123cd8f384f5a72248ec2a |
|
BLAKE2b-256 | 1b2f6d3611522ee0764896c33f38c44fb1ce26b244feb53ad0a47b422bb1d3dc |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ffcc9efef1a54271ae1f9ee6291b9ed2a890f9e2cdc000bf4ab08cd775feb3d |
|
MD5 | b7c2519fbf986dd3fedbf90e316f896b |
|
BLAKE2b-256 | f20e73e1942373d4ca90ac4e2f8bbd5a3d195b208d6351116459a7bed4c492b0 |
Hashes for opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3de44a4891838fa8a05029dd5442ae1a8d7ae62f5b0905251e54e6379f28f31 |
|
MD5 | d7710db34adc72997de4373e9f5900ce |
|
BLAKE2b-256 | 068e8d6526db023bba3fd4bf9a0d1826438d2c41a74f7c7ea3282294ec88095f |