segtok

sentence segmentation and word tokenization tools

These details have not been verified by PyPI

Project links

Homepage

Project description

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, German, Italian, or French in mind. Extending the provided functionality to more foreign languages (e.g., CJK) might be even trickier.

Install

To install this package, you need to have the latest official version of Python installed. The easiest way to get it installed is using pip or any other package manager that works with PyPI:

pip install segtok

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer

Usage

For details, please refer to the respective documentation; This README only provides an overview of the provided functionality.

A command-line

After installing the package, two command-line tools will be available, segmenter and tokenizer. Each can take UTF-8 encoded plain-text and transforms it into newline-separated sentences or tokens, respectively. The tokenizer assumes that each line contains (at most) one single sentence, which is the output format of the segmenter. To learn more about each tool, please invoke them with their help option (-h or --help).

B segtok.segmenter

This module provides several split_... functions to segment texts into lists of sentences. In addition, to_unix_linebreaks normalizes linebreaks (including the Unicode linebreak) to newline control characters (\\n). The function rewrite_line_separators can be used to move (rewrite) the newline separators in the input text so that they are placed at the sentence segmentation locations.

C segtok.tokenizer

This module provides several ..._tokenizer functions to tokenize input sentences into words and symbols. In addition, it provides convenience functionality for English texts: Two compiled patterns (IS_...) can be used to detect if a word token contains a possessive-s marker (“Frank’s”) or is an apostrophe-based contraction (“didn’t”). Tokens that match these patterns can then be split using the split_possessive_markers and split_contractions functions, respectively.

Legal

License: MIT

History

1.1.2 fixed Unicode list of valid sentence terminals (was missing U+2048)
1.1.1 fixed PyPI setup (missing MANIFEST.in for README.rst and “packages” in setup.py)
1.1.0 added possessive-s marker and apostrophe contraction splitting of tokens
1.0.0 initial release

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.5.11

Dec 15, 2021

1.5.10

May 12, 2020

1.5.9

Apr 9, 2020

1.5.8

Apr 9, 2020

1.5.7

Aug 3, 2018

1.5.6

May 20, 2017

1.5.5

Apr 17, 2017

1.5.4

Mar 2, 2017

1.5.3

Mar 2, 2017

1.5.2

Jan 17, 2017

1.5.1

Sep 14, 2015

1.5.0

Jul 22, 2015

1.4.0

Jun 23, 2015

1.3.1

Jun 22, 2015

1.3.0.0

Mar 27, 2015

1.3.0

Mar 27, 2015

1.2.2

Jan 20, 2015

1.2.1

Jan 14, 2015

1.2.0

Jan 13, 2015

This version

1.1.2

Jan 12, 2015

1.1.1

Dec 14, 2014

1.1.0

Nov 27, 2014

1.0.0

Nov 27, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segtok-1.1.2.tar.gz (16.2 kB view details)

Uploaded Jan 12, 2015 Source

File details

Details for the file segtok-1.1.2.tar.gz.

File metadata

Download URL: segtok-1.1.2.tar.gz
Upload date: Jan 12, 2015
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for segtok-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9df7fe9a2a56a1cc3d734d5dff2550406075cac2bf00e9819eea69619f0a6dc2`
MD5	`9fe80df3f612687ba3896b66fa739746`
BLAKE2b-256	`98fdb5361e4b10a9bf0b551bbd5764de3de76a7823b2fcda718bc6e03f586283`

See more details on using hashes here.

segtok 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Install

Usage

A command-line

B segtok.segmenter

C segtok.tokenizer

Legal

History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes