Skip to main content

c++ mosestokenizer

Project description

opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opus-fast-mosestokenizer-0.0.8.2.tar.gz (86.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-macosx_10_15_x86_64.whl (737.2 kB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-macosx_10_15_x86_64.whl (737.3 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl (737.2 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl (736.2 kB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl (736.2 kB view details)

Uploaded CPython 3.6mmacOS 10.15+ x86-64

File details

Details for the file opus-fast-mosestokenizer-0.0.8.2.tar.gz.

File metadata

File hashes

Hashes for opus-fast-mosestokenizer-0.0.8.2.tar.gz
Algorithm Hash digest
SHA256 5c63ff5e83c126f881c746058506929fe618ba5546d1f50f690c233d7c5bd4ab
MD5 548903f186e4bdd2ec5695f1d3a37ea9
BLAKE2b-256 a75293ae6b3fa18f8d0e1b36bffaef0bee20b8ff3dc6042e00805de8fd1c49d5

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6805309d461a2da8bdae870ea4b065589426823c3d3cf8c5e6d9a17f742cee35
MD5 11a9f791a4bf824de9ced5ec1f54f442
BLAKE2b-256 dd568326a08c9186ba51a5d469f7fdde9d31c39cfacd256b3e4b079b15a5a6ec

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 49741eee7929c5021deebedd44e86ef1a598762bec9527e6b562f7f5e372c556
MD5 4b82809ec8516c3d4e8fd04fb9b43d3c
BLAKE2b-256 3265ab28f7b8bfad572cea26933829e75ed5c4c32c1f5e1669ed414a63defd8f

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 27b4cd24127e6b96782f8818ce8df39255473e69a65916ee1075c0b2dcfacf17
MD5 202735927326db6edae1a16a6bdcd53e
BLAKE2b-256 f877caca2118d277ecdce334046c2a436c590714e9cf6c8993504b4fd48489a3

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8c4ed9bfdb7e27e55acd553044d2d78d0163141fa965e6b42d84025692cd7da6
MD5 c475cb944c946883cd0ab984a1f0cb74
BLAKE2b-256 a339898c559cab9b198ea405f777e26f68db0838c616a85437862b41683d1b18

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4d754c8b88a82eba4099fa84d37835ca2165f5d7d4db02739c8e55df6d80f602
MD5 1b0b783d2db650053be32b115e4c3f63
BLAKE2b-256 f8f1d30a324fe1d02c054afd7647331d0c97479858328cdfa727cb6bd8ee3167

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 1d3c73063741f25c4063b78a76171537d2f9f5f4f4bf6a79d91cb4be165f37f0
MD5 247a43d226d2c7cd3f2b8cb9b6a497b3
BLAKE2b-256 3c17ed950b92a0fe607f82e640aeb79651072a619a0aa708b38aa8f0c171fce9

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 924485c0014849c8841b650eb96841466ce5081d475c75b65926e320c3474273
MD5 337bd0c37b8e21e2c9ca85b8f5945663
BLAKE2b-256 564ceec3b7088c0b0a1b9759fa3ffb464ea728aa096b9421016a43e17704e2a0

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 c309c1a508ad3fb00e23152c1b2b7cfab4c5ae8df4c61a74ef26ad40f5441c7a
MD5 17523b82b1fa4cf3b5f0600a735b03e9
BLAKE2b-256 d30848c10bfcc20c6b9f2bd3d972fd00ea575c40f3f5e258d5b0b24db18b3425

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7eb78db84a406b8d9070ed7ccb8abc2ee260290378945c1d2c8da0ecf9862a57
MD5 b8ff380e2b60eb1cce5f0d3a397fcd60
BLAKE2b-256 5be463dd302fa8c12d94ca4cce063049ff1c5e4c28a4f2489ca5ea881ce105c3

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 19fca77e9457c34d5410ee805eb94ee9b05c2f3cc661ec0af50d7a004380492e
MD5 6cdb38fc988d6b014412158f66891117
BLAKE2b-256 0db08b6a01da7e0210dec417a12b83c5f43a48c5027546decfb3ecd809232898

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page