Skip to main content

c++ mosestokenizer

Project description

opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opus_fast_mosestokenizer-0.0.8.7.tar.gz (86.9 kB view details)

Uploaded Source

Built Distributions

opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-macosx_12_0_x86_64.whl (728.3 kB view details)

Uploaded CPython 3.12 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-macosx_12_0_x86_64.whl (727.4 kB view details)

Uploaded CPython 3.11 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-macosx_12_0_x86_64.whl (727.5 kB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-macosx_12_0_x86_64.whl (727.5 kB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-macosx_12_0_x86_64.whl (727.4 kB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-macosx_12_0_x86_64.whl (726.7 kB view details)

Uploaded CPython 3.7m macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-macosx_12_0_x86_64.whl (726.7 kB view details)

Uploaded CPython 3.6m macOS 12.0+ x86-64

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7.tar.gz.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7.tar.gz
Algorithm Hash digest
SHA256 6def802ea359bc56f39573bf3580472759bee356f60079bb900dc3a0f039b82c
MD5 bf04d689918d5622c26b7aa2710a7a37
BLAKE2b-256 10e586be04a7a3b95ea6abc7c4dc1e30209f36d15d29e3a6fbf590590cc532c9

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a2d00412d0400383597f9c6ca472a3243f2f97520f28905ab3edb67499749261
MD5 905f00c725c14d69c5f4814c45631c74
BLAKE2b-256 8770f0251d40a3e9d90d9bb609b30849018ec194927765ff6fca98d18b7247aa

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp312-cp312-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 f895d1238c44048d5de76f9b8c8b87f3a5164784b48c4c20f8d67295538ed931
MD5 10846f545fe6159c30609a42336c4178
BLAKE2b-256 341c30784f6c5293c1aa23acba44be1a1cf272c9206b86d58111d6699f8534b3

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 65ae245b2e20a2444253fe4cdd3c97507f2ed6ccef5227304d3c1a0d45f7fde1
MD5 fef8d042fb71aa7c57f1747d6d5adb2e
BLAKE2b-256 ea993dca66ef3144260f8cddc747b21b3ac19e4a9d65e30c0bf8a414df9be2b6

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 fa715afafc09f7e25fefc047fdaa50d9061b59422b4f5306e38677fd673b4b4a
MD5 2f0fe60ca1c3424011b03b03563f6fd9
BLAKE2b-256 2993df7d270b768297c0fec921478839e2a452fe7a722348f49f6353f50c494f

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 150c76d23c5f0dcf79e1bfabe41a8d1341513774787542cd564df01e9afabeec
MD5 a65d3df5a4a8734c2637a271b8144179
BLAKE2b-256 23ea48b13e41b77384fa7ffe0238f02f717837587e6719b82ae6f4686dc08b0d

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 4ac5b20ab98b778712e1518047dc4164a9080b1cff8ff5348c44e6f7459728c9
MD5 e0ecc5526dfb09dd852b3631ca3e68d1
BLAKE2b-256 c34dda56cec8159bedb7ceef6e8e72a51a692ce15f40b1de66b1ac555cc38c90

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 44bea21906090519f6f2da7ba221a3016c5b737d28eae52d7a60c487d74f446e
MD5 3c0b65b84e8dc3716d4c5a8a43bca508
BLAKE2b-256 a4f8fad76fa103e577276970c1a6e1d553be3d2ba3f397677dcb7e29b64f9b87

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 78ae123a802b58d453637c3afdad09a23bd5d15d88ab1575ad77cdcf742a12d4
MD5 de58b14c9e3240980db6a0cb85034002
BLAKE2b-256 cca9f52913bffb435f34ba5b9761bc8e1e9793d4b19f64561ef6b34f47586cf5

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8c85262b04eeeb31a1c2e34039913c6dd880c321b6ab59b6a26dd3ea4f001a1b
MD5 70e62ef8ba4a15a7416198f645437a39
BLAKE2b-256 62ccd289beabeac3c59c1d309749bf3811d73a713ad98eef3157533a15f4e80d

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 e771804ffd7b0109c86d54c1b91ed25226c29b0056ba1f508bd42fb899c29bfa
MD5 dd3f034e6f691a8192c563904d111871
BLAKE2b-256 3b3c31e52030796a9bea2589ed4249a579ef391bb840abec4e29dcabeac911e8

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 50bd40363a6f2cae719279bd18297b0cba5f81a31fea458417f443996f7ee98c
MD5 2a0e4a0b8b98d6d717c8fc491961ab68
BLAKE2b-256 c6dfeab3565b7c2a12c21719341a55b2f5890b5808bc3a13670ff8c49cc56a27

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp37-cp37m-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 7326631d8669e4ff19b92e54a2c0c6f9ae04e5b28b6d129af88a672ae934c721
MD5 7801248672123cd8f384f5a72248ec2a
BLAKE2b-256 1b2f6d3611522ee0764896c33f38c44fb1ce26b244feb53ad0a47b422bb1d3dc

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2ffcc9efef1a54271ae1f9ee6291b9ed2a890f9e2cdc000bf4ab08cd775feb3d
MD5 b7c2519fbf986dd3fedbf90e316f896b
BLAKE2b-256 f20e73e1942373d4ca90ac4e2f8bbd5a3d195b208d6351116459a7bed4c492b0

See more details on using hashes here.

File details

Details for the file opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for opus_fast_mosestokenizer-0.0.8.7-cp36-cp36m-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 f3de44a4891838fa8a05029dd5442ae1a8d7ae62f5b0905251e54e6379f28f31
MD5 d7710db34adc72997de4373e9f5900ce
BLAKE2b-256 068e8d6526db023bba3fd4bf9a0d1826438d2c41a74f7c7ea3282294ec88095f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page