Skip to main content

Word and sentence tokenization.

Project description

XML cleaner

Word and sentence tokenization in Python. Tested in Python 3.4.3 and 2.7.12.

[![PyPI version](https://badge.fury.io/py/xml-cleaner.svg)](https://badge.fury.io/py/xml-cleaner) ![Jonathan Raiman, author](https://img.shields.io/badge/Author-Jonathan%20Raiman%20-blue.svg)

[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE.md)

Usage

Use this package to split up strings according to sentence and word boundaries. For instance, to simply break up strings into tokens:

` tokenize("Joey was a great sailor.") #=> ["Joey ", "was ", "a ", "great ", "sailor ", "."] `

To also detect sentence boundaries:

` sent_tokenize("Cat sat mat. Cat's named Cool.", keep_whitespace=True) #=> [["Cat ", "sat ", "mat", ". "], ["Cat ", "'s ", "named ", "Cool", "."]] `

sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.

Installation

` pip3 install xml_cleaner `

Testing

Run nose2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml-cleaner-2.0.1.tar.gz (10.6 kB view details)

Uploaded Source

File details

Details for the file xml-cleaner-2.0.1.tar.gz.

File metadata

  • Download URL: xml-cleaner-2.0.1.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for xml-cleaner-2.0.1.tar.gz
Algorithm Hash digest
SHA256 5dd42d362a2d451e5ccbf1267a199f2e6354599b2b3bbefc0f35269aec77ef0d
MD5 e6d363fc238ddb825e4f8cf19a30305c
BLAKE2b-256 abef87f2c39e120ff22e226532e920e627bed196cecc3d1213336380b6745c6e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page