Skip to main content

purify text for NLP.

Project description

# text-cleaner, simple text preprocessing tool

## Introduction

* Support Python 2.7, 3.3, 3.4, 3.5.
* Simple interfaces.
* Easy to extend.

## Install

pip install text-cleaner

**WARNING FOR PYTHON 2.7 USERS**: Only UCS-4 build is supported(`--enable-unicode=ucs4`), UCS-2 build ([see this]( is **NOT SUPPORTED** in the latest version.

## Usage

from text_cleaner import remove, keep

from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION
from text_cleaner.processor.misc import RESTRICT_URL

# remove url and ascii characters.
# return: u'点击 查看 '
'点击 查看,123456,test',

# remove only Chinese punctuation.
# return: u'点击 查看,123456,test '
'点击:, 查看,123456,test。!?',

# keep chinese characters and url.
# return: u'点击 查看'
'点击 查看,123456,test',

# use processor directly.
# return: u'点击 查看'
RESTRICT_URL.remove('点击 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击 查看')

## Interfaces

*text_cleaner.remove(text, processors)*:

* *text*: `str` or `bytes` (`unicode` or `str` for Python 2).
* *processors*: iterable of processors. *remove* invokes `remove` of each processor to handle *text*.

*text_cleaner.keep(text, processors)*:

* same as *remove*, but invoke `keep` method of processors instead.

## Processors

*DEFAULT\_REPLACE\_TEXT*: `' '`, single space.

*RegexProcessor(regex, replace\_text=DEFAULT\_REPLACE\_TEXT)*

* contruct a regex processor for *regex*, replace unmatched components with *replace\_text*.
* *replace(self, new\_replace\_text)*: create a new processor, with new *replace\_text* is set.
* *remove(self, text)*: remove all occurences of *regex* from *text*.
* *keep(self, text)*: keep only the occurences of *regex*, remove all unmatched components from *text*.
* *verify(self, text)*: return *True* if text match *regex*, otherwise returns *False*.

*UnicodeRange(begin, end)*:

* *begin*: *int*, the begin of unicode range.
* *end*: *int*, the end of unicode range.

*UnicodeRangeProcessor(ranges, replace\_text=DEFAULT\_REPLACE\_TEXT)*

* subclass of *RegexProcessor*.
* *ranges*: iterable of instances of *UnicodeRange*.

## Built-in Processors

Following processors are defined by *UnicodeRange* and regex. Read the source code if you are sure about what's going on.

`text_cleaner.processor.common`, for common usage:


`text_cleaner.processor.misc`, misellanious processors:

* `URL`

`text_cleaner.processor.chinese`, Chinese processing:

* `CHINESE_CHARACTER`: only common characters.
* `CHINESE`: common characters + symbols and puntuations.
* `CHINESE_ALL`: all CJK characters.


How to define URLs is a complex problem.
We provide two choices for our users.

* `URL`: truncate urls till whitespaces.
* `RESTRICT_URL`: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)

For Chinese users, we recommend using `RESTRICT_URL`.

from text_cleaner.processor.misc import RESTRICT_URL, URL

URL.remove('点击 查看')
# '点击 查看'

# '点击 '

# '点击 查看'

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for text_cleaner, version 0.2.6
Filename, size File type Python version Upload date Hashes
Filename, size text_cleaner-0.2.6-py2.py3-none-any.whl (11.5 kB) File type Wheel Python version 2.7 Upload date Hashes View
Filename, size text_cleaner-0.2.6.tar.gz (14.5 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page