Skip to main content

text-sentence is text tokenizer and sentence splitter

Project description

Text tokenizer and sentence splitter

Library “text-sentence” is text tokenizer and sentence splitter.

Input is for main function is text, list of known names and abbreviations. Result is list of tokens. Each token has type and other attributes i.e.:

  • is word,
  • is number,
  • is roman number,
  • is sentence end,
  • is abbreviation,
  • is name,
  • is end of chapter
  • etc.

Determining end of sentence needs special logic and care what is the main reason for naming package with “text-sentence”.


tokenization, sentence splitter, sentencer, chapter, names, abbreviation


Robert Lujo, Zagreb, Croatia, find mail address in LICENCE


To name the most important:
  • TODO: …

System is based on unicode strings.

Check Getting started.


Installation instructions - if you have installed pip package

pip install text-sentence
If not, then do it old-fashioned way:

Development version you can see at

or Mercurial clone with:

hg clone


Usage example - start python shell:

>>> from text_sentence import Tokenizer
>>> t = Tokenizer()
>>> list(t.tokenize("This is first sentence. This is second one!And this is third, is it?"))
[T('this'/sent_start), T('is'), T('first'), T('sentence'), T('.'/sent_end),
 T('this'/sent_start), T('is'), T('second'), T('one'), T('!'/sent_end),
 T('and'/sent_start), T('this'), T('is'), T('third'), T(','/inner_sep),
 T('is'), T('it'), T('?'/sent_end)]

More samples can be found in tests:


Since there is currently no good documentation, the best source of further information is by reading tests inside of module and tests test_sentence. More information in Running tests. You can allways read a source.


Currently there is no documentation. In progress …


Since this project is limited by my free time, support is limited.


If you encounter bug, the best is to report it to the bitbucket web page

The best way to contact me is by mail (find in LICENCE).

TODO list is in readme.txt (dev version).


Since this project is not currently in the stable API phase, contribution should wait for a while.


All tests are doctests (not unittests). There are two type of tests in the package:

  1. doctests in module i.e. in
  2. doctests in test_sentence.txt

Running module directly will run 1. and 2.

To run tests:
  • goto text_sentence directory

  • run tests by running module, e.g.:

    > python
    __main__: running doctests
    test_sentence.txt: running doctests
  • other with:

    > python -m"text_sentence"


various things, see readme.txt in dev version for details.



ulr1 100619:
  • sample in getting started


ulr1 100619:
  • test_sentence.txt installation
  • readme fix main title


ulr1 100618:
  • adapted tests
  • and


ulr1 100617:
  • first installable release

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution (25.4 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page