Word and sentence tokenization.
Project description
XML cleaner
Word and sentence tokenization in Python.
[![PyPI version](https://badge.fury.io/py/xml-cleaner.svg)](https://badge.fury.io/py/xml-cleaner) [![Build Status](https://travis-ci.org/JonathanRaiman/xml_cleaner.svg?branch=master)](https://travis-ci.org/JonathanRaiman/xml_cleaner) ![Jonathan Raiman, author](https://img.shields.io/badge/Author-Jonathan%20Raiman%20-blue.svg)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE.md)
Usage
Use this package to split up strings according to sentence and word boundaries. For instance, to simply break up strings into tokens:
` tokenize("Joey was a great sailor.") #=> ["Joey ", "was ", "a ", "great ", "sailor ", "."] `
To also detect sentence boundaries:
` sent_tokenize("Cat sat mat. Cat's named Cool.", keep_whitespace=True) #=> [["Cat ", "sat ", "mat", ". "], ["Cat ", "'s ", "named ", "Cool", "."]] `
sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.
Installation
` pip3 install xml_cleaner `
Testing
Run nose2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for xml_cleaner-2.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec4fe5a2d98476c7702258920d97891192d813e64fd3dcf0a468f69d7b4abf26 |
|
MD5 | aa55a34f3ec31a6d600b36510ad15a93 |
|
BLAKE2b-256 | d3b5ec5a6237c5cebb17b2751ab3bbdd7c6fb7ad868530d91100899bd38fd617 |