Word and sentence tokenization.
Project description
XML cleaner
Word and sentence tokenization in Python. Tested in Python 3.4.3 and 2.7.12.
![Jonathan Raiman, author](https://img.shields.io/badge/Author-Jonathan%20Raiman%20-blue.svg)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE.md)
Usage
Use this package to split up strings according to sentence and word boundaries. For instance, to simply break up strings into tokens:
` tokenize("Joey was a great sailor.") #=> ["Joey ", "was ", "a ", "great ", "sailor ", "."] `
To also detect sentence boundaries:
` sent_tokenize("Cat sat mat. Cat's named Cool.", keep_whitespace=True) #=> [["Cat ", "sat ", "mat", ". "], ["Cat ", "'s ", "named ", "Cool", "."]] `
sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.
Installation
` pip3 install xml_cleaner `
Testing
Run nose2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file xml-cleaner-2.0.0.tar.gz
.
File metadata
- Download URL: xml-cleaner-2.0.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3dae3898392778be4715190c43ffe1e9b69896d2b60e2bf6462cfcc1c6da2503 |
|
MD5 | 9858e5b4c31db157c543f25c5d4ac6ad |
|
BLAKE2b-256 | 034abb4c5a1def3c2345708c33f8ed211369aab9ae6f324f233acdf8e5f2c65b |