Skip to main content

purify text for NLP.

Project description

# text-cleaner, simple text preprocessing tool

## Introduction

* Support Python 2.7, 3.3, 3.4, 3.5.
* Simple interfaces.
* Easy to extend.

## Install

```
pip install text-cleaner
```

**WARNING FOR PYTHON 2.7 USERS**: Only UCS-4 build is supported(`--enable-unicode=ucs4`), UCS-2 build ([see this](http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to-not-get-a-bad-character-range-error)) is **NOT SUPPORTED** in the latest version.

## Usage

```python
from text_cleaner import remove, keep

from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION
from text_cleaner.processor.misc import RESTRICT_URL

# remove url and ascii characters.
# return: u'点击 查看 '
remove(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[RESTRICT_URL, ASCII],
)

# remove only Chinese punctuation.
# return: u'点击 http://t.cn/RtU0mZ1 查看,123456,test '
remove(
'点击:http://t.cn/RtU0mZ1, 查看,123456,test。!?',
[RESTRICT_URL, ASCII],
)

# keep chinese characters and url.
# return: u'点击 http://t.cn/RtU0mZ1 查看'
keep(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[CHINESE, RESTRICT_URL],
)

# use processor directly.
# return: u'点击 查看'
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击http://t.cn/RtU0mZ1 查看')
```

## Interfaces

*text_cleaner.remove(text, processors)*:

* *text*: `str` or `bytes` (`unicode` or `str` for Python 2).
* *processors*: iterable of processors. *remove* invokes `remove` of each processor to handle *text*.

*text_cleaner.keep(text, processors)*:

* same as *remove*, but invoke `keep` method of processors instead.

## Processors

*DEFAULT\_REPLACE\_TEXT*: `' '`, single space.

*RegexProcessor(regex, replace\_text=DEFAULT\_REPLACE\_TEXT)*

* contruct a regex processor for *regex*, replace unmatched components with *replace\_text*.
* *replace(self, new\_replace\_text)*: create a new processor, with new *replace\_text* is set.
* *remove(self, text)*: remove all occurences of *regex* from *text*.
* *keep(self, text)*: keep only the occurences of *regex*, remove all unmatched components from *text*.
* *verify(self, text)*: return *True* if text match *regex*, otherwise returns *False*.

*UnicodeRange(begin, end)*:

* *begin*: *int*, the begin of unicode range.
* *end*: *int*, the end of unicode range.

*UnicodeRangeProcessor(ranges, replace\_text=DEFAULT\_REPLACE\_TEXT)*

* subclass of *RegexProcessor*.
* *ranges*: iterable of instances of *UnicodeRange*.

## Built-in Processors

Following processors are defined by *UnicodeRange* and regex. Read the source code if you are sure about what's going on.

`text_cleaner.processor.common`, for common usage:

* `ALPHA`
* `DIGIT`
* `SYMBOLS_AND_PUNCTUATION`
* `ASCII`
* `ALPHA_EXTENSION`
* `DIGIT_EXTENSION`
* `SYMBOLS_AND_PUNCTUATION_EXTENSION`
* `GENERAL_PUNCTUATION`

`text_cleaner.processor.misc`, misellanious processors:

* `URL`
* `RESTRICT_URL`
* `ESCAPED_WHITESPACE`
* `WECHAT_EMOJI_EN`
* `WECHAT_EMOJI_ZHCN`
* `WECHAT_EMOJI`

`text_cleaner.processor.chinese`, Chinese processing:

* `CHINESE_CHARACTER`: only common characters.
* `CHINESE`: common characters + symbols and puntuations.
* `CHINESE_ALL`: all CJK characters.
* `CHINESE_EXTENSION`
* `CHINESE_COMPATIBILITY`
* `CHINESE_SYMBOLS_AND_PUNCTUATION`

### URL vs. RESTRICT_URL

How to define URLs is a complex problem.
We provide two choices for our users.

* `URL`: truncate urls till whitespaces.
* `RESTRICT_URL`: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)

For Chinese users, we recommend using `RESTRICT_URL`.

```python
from text_cleaner.processor.misc import RESTRICT_URL, URL

URL.remove('点击http://t.cn/RtU0mZ1 查看')
# '点击 查看'

URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 '

RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 查看'
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_cleaner-0.2.6.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

text_cleaner-0.2.6-py2.py3-none-any.whl (11.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file text_cleaner-0.2.6.tar.gz.

File metadata

File hashes

Hashes for text_cleaner-0.2.6.tar.gz
Algorithm Hash digest
SHA256 149117a31b5c03e224f956ad274648b9c8d37bf7c05234682728aaa4d6dc6367
MD5 a8c5d2caf1f2117fa43ff45de5b1f1f8
BLAKE2b-256 abd43fb4c54528a6a44ddc41b9a582af03ef2caef111d977b491d455b9cec1a7

See more details on using hashes here.

File details

Details for the file text_cleaner-0.2.6-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for text_cleaner-0.2.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1edbf15f212593824f8f9ddd651b82f7735505576142fbc60b426b5b11de197c
MD5 24f743de4d00aadbac541d437197f3e7
BLAKE2b-256 b08b446922cde7c811dbb8fb71ffe75becd28c9fdb83e62438f6f64d1372e17d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page