Word and sentence tokenization.
Project description
XML cleaner
Word and sentence tokenization in Python. Tested in Python 3.4.3 and 2.7.12.
[![PyPI version](https://badge.fury.io/py/xml-cleaner.svg)](https://badge.fury.io/py/xml-cleaner) ![Jonathan Raiman, author](https://img.shields.io/badge/Author-Jonathan%20Raiman%20-blue.svg)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE.md)
Usage
Use this package to split up strings according to sentence and word boundaries. For instance, to simply break up strings into tokens:
` tokenize("Joey was a great sailor.") #=> ["Joey ", "was ", "a ", "great ", "sailor ", "."] `
To also detect sentence boundaries:
` sent_tokenize("Cat sat mat. Cat's named Cool.", keep_whitespace=True) #=> [["Cat ", "sat ", "mat", ". "], ["Cat ", "'s ", "named ", "Cool", "."]] `
sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.
Installation
` pip3 install xml_cleaner `
Testing
Run nose2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for xml_cleaner-2.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 275bbd9662973a129204cc52393277d0d8e95927d8e4cbc1dc30fb1268ac5a87 |
|
MD5 | cecd56ea21fc91d89fba7f9fa55aee0b |
|
BLAKE2b-256 | 85687e1e588cbc8d0da753ec49f2cd83bf36e188f03390bf99dec7a2fdbf0f89 |