Word and sentence tokenization.
Word and sentence tokenization in Python.
[![PyPI version](https://badge.fury.io/py/xml-cleaner.svg)](https://badge.fury.io/py/xml-cleaner) [![Build Status](https://travis-ci.org/JonathanRaiman/xml_cleaner.svg?branch=master)](https://travis-ci.org/JonathanRaiman/xml_cleaner) ![Jonathan Raiman, author](https://img.shields.io/badge/Author-Jonathan%20Raiman%20-blue.svg)
Use this package to split up strings according to sentence and word boundaries. For instance, to simply break up strings into tokens:
` tokenize("Joey was a great sailor.") #=> ["Joey ", "was ", "a ", "great ", "sailor ", "."] `
To also detect sentence boundaries:
` sent_tokenize("Cat sat mat. Cat's named Cool.", keep_whitespace=True) #=> [["Cat ", "sat ", "mat", ". "], ["Cat ", "'s ", "named ", "Cool", "."]] `
sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.
` pip3 install xml_cleaner `
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size xml-cleaner-2.0.4.tar.gz (10.8 kB)||File type Source||Python version None||Upload date||Hashes View|