Skip to main content

Fast and Efficient Sentence Tokenization

Project description

Fast Sentence Tokenizer (fast-sentence-tokenize)

Best in class tokenizer

Usage

Import

from fast_sentence_tokenize import fast_sentence_tokenize

Call Tokenizer

results = fast_sentence_tokenize("isn't a test great!!?")

Results

[
   "isn't",
   "a",
   "test",
   "great",
   "!",
   "!",
   "?"
]

Note that whitespace is not preserved in the output by default.

This generally results in a more accurate parse from downstream components, but may make the reassembly of the original sentence more challenging.

Preserve Whitespace

results = fast_sentence_tokenize("isn't a test great!!?", eliminate_whitespace=False)

Results

[
   "isn't ",
   "a ",
   "test ",
   "great",
   "!",
   "!",
   "?"
]

This option preserves whitespace.

This is useful if you want to re-assemble the tokens using the pre-existing spacing

assert ''.join(tokens) == input_text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_sentence_tokenize-0.1.15.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

fast_sentence_tokenize-0.1.15-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file fast_sentence_tokenize-0.1.15.tar.gz.

File metadata

  • Download URL: fast_sentence_tokenize-0.1.15.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.5.0

File hashes

Hashes for fast_sentence_tokenize-0.1.15.tar.gz
Algorithm Hash digest
SHA256 0f5d8f5691f8dc41e321eac720ddaf1cb59fd33259e5482f78992e26162ac294
MD5 c3ab532f89691946b53b66991e91a87b
BLAKE2b-256 36591c68d48388ab9d7e6e77a2d6029d94317159bd1d6dadb6a533facd99cdf1

See more details on using hashes here.

File details

Details for the file fast_sentence_tokenize-0.1.15-py3-none-any.whl.

File metadata

File hashes

Hashes for fast_sentence_tokenize-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 85eed0ba762a6f919c7628b8c6951c5a09abf8f0544bfcf5add033c0e59e0b8d
MD5 6f255453224b8296ff8dab0677c56b88
BLAKE2b-256 0ebc4f5de44e36700aff3303c1def32e7d155146cd9070d7f92ca8904e9983c2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page