purify text for NLP.
Project description
# text-cleaner, simple text preprocessing tool
## Introduction
* Support Python 2.7, 3.3, 3.4, 3.5.
* Simple interfaces.
* Easy to extend.
## Install
```
pip install text-cleaner
```
**WARNING FOR PYTHON 2.7 USERS**: Only UCS-4 build is supported(`--enable-unicode=ucs4`), UCS-2 build ([see this](http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to-not-get-a-bad-character-range-error)) is **NOT SUPPORTED** in the latest version.
## Usage
```python
from text_cleaner import remove, keep
from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION
from text_cleaner.processor.misc import RESTRICT_URL
# remove url and ascii characters.
# return: u'点击 查看 '
remove(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[RESTRICT_URL, ASCII],
)
# remove only Chinese punctuation.
# return: u'点击 http://t.cn/RtU0mZ1 查看,123456,test '
remove(
'点击:http://t.cn/RtU0mZ1, 查看,123456,test。!?',
[RESTRICT_URL, ASCII],
)
# keep chinese characters and url.
# return: u'点击 http://t.cn/RtU0mZ1 查看'
keep(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[CHINESE, RESTRICT_URL],
)
# use processor directly.
# return: u'点击 查看'
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击http://t.cn/RtU0mZ1 查看')
```
## Interfaces
*text_cleaner.remove(text, processors)*:
* *text*: `str` or `bytes` (`unicode` or `str` for Python 2).
* *processors*: iterable of processors. *remove* invokes `remove` of each processor to handle *text*.
*text_cleaner.keep(text, processors)*:
* same as *remove*, but invoke `keep` method of processors instead.
## Processors
*DEFAULT\_REPLACE\_TEXT*: `' '`, single space.
*RegexProcessor(regex, replace\_text=DEFAULT\_REPLACE\_TEXT)*
* contruct a regex processor for *regex*, replace unmatched components with *replace\_text*.
* *replace(self, new\_replace\_text)*: create a new processor, with new *replace\_text* is set.
* *remove(self, text)*: remove all occurences of *regex* from *text*.
* *keep(self, text)*: keep only the occurences of *regex*, remove all unmatched components from *text*.
* *verify(self, text)*: return *True* if text match *regex*, otherwise returns *False*.
*UnicodeRange(begin, end)*:
* *begin*: *int*, the begin of unicode range.
* *end*: *int*, the end of unicode range.
*UnicodeRangeProcessor(ranges, replace\_text=DEFAULT\_REPLACE\_TEXT)*
* subclass of *RegexProcessor*.
* *ranges*: iterable of instances of *UnicodeRange*.
## Built-in Processors
Following processors are defined by *UnicodeRange* and regex. Read the source code if you are sure about what's going on.
`text_cleaner.processor.common`, for common usage:
* `ALPHA`
* `DIGIT`
* `SYMBOLS_AND_PUNCTUATION`
* `ASCII`
* `ALPHA_EXTENSION`
* `DIGIT_EXTENSION`
* `SYMBOLS_AND_PUNCTUATION_EXTENSION`
* `GENERAL_PUNCTUATION`
`text_cleaner.processor.misc`, misellanious processors:
* `URL`
* `RESTRICT_URL`
* `ESCAPED_WHITESPACE`
* `WECHAT_EMOJI_EN`
* `WECHAT_EMOJI_ZHCN`
* `WECHAT_EMOJI`
`text_cleaner.processor.chinese`, Chinese processing:
* `CHINESE_CHARACTER`: only common characters.
* `CHINESE`: common characters + symbols and puntuations.
* `CHINESE_ALL`: all CJK characters.
* `CHINESE_EXTENSION`
* `CHINESE_COMPATIBILITY`
* `CHINESE_SYMBOLS_AND_PUNCTUATION`
### URL vs. RESTRICT_URL
How to define URLs is a complex problem.
We provide two choices for our users.
* `URL`: truncate urls till whitespaces.
* `RESTRICT_URL`: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)
For Chinese users, we recommend using `RESTRICT_URL`.
```python
from text_cleaner.processor.misc import RESTRICT_URL, URL
URL.remove('点击http://t.cn/RtU0mZ1 查看')
# '点击 查看'
URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 '
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 查看'
```
## Introduction
* Support Python 2.7, 3.3, 3.4, 3.5.
* Simple interfaces.
* Easy to extend.
## Install
```
pip install text-cleaner
```
**WARNING FOR PYTHON 2.7 USERS**: Only UCS-4 build is supported(`--enable-unicode=ucs4`), UCS-2 build ([see this](http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to-not-get-a-bad-character-range-error)) is **NOT SUPPORTED** in the latest version.
## Usage
```python
from text_cleaner import remove, keep
from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION
from text_cleaner.processor.misc import RESTRICT_URL
# remove url and ascii characters.
# return: u'点击 查看 '
remove(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[RESTRICT_URL, ASCII],
)
# remove only Chinese punctuation.
# return: u'点击 http://t.cn/RtU0mZ1 查看,123456,test '
remove(
'点击:http://t.cn/RtU0mZ1, 查看,123456,test。!?',
[RESTRICT_URL, ASCII],
)
# keep chinese characters and url.
# return: u'点击 http://t.cn/RtU0mZ1 查看'
keep(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[CHINESE, RESTRICT_URL],
)
# use processor directly.
# return: u'点击 查看'
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击http://t.cn/RtU0mZ1 查看')
```
## Interfaces
*text_cleaner.remove(text, processors)*:
* *text*: `str` or `bytes` (`unicode` or `str` for Python 2).
* *processors*: iterable of processors. *remove* invokes `remove` of each processor to handle *text*.
*text_cleaner.keep(text, processors)*:
* same as *remove*, but invoke `keep` method of processors instead.
## Processors
*DEFAULT\_REPLACE\_TEXT*: `' '`, single space.
*RegexProcessor(regex, replace\_text=DEFAULT\_REPLACE\_TEXT)*
* contruct a regex processor for *regex*, replace unmatched components with *replace\_text*.
* *replace(self, new\_replace\_text)*: create a new processor, with new *replace\_text* is set.
* *remove(self, text)*: remove all occurences of *regex* from *text*.
* *keep(self, text)*: keep only the occurences of *regex*, remove all unmatched components from *text*.
* *verify(self, text)*: return *True* if text match *regex*, otherwise returns *False*.
*UnicodeRange(begin, end)*:
* *begin*: *int*, the begin of unicode range.
* *end*: *int*, the end of unicode range.
*UnicodeRangeProcessor(ranges, replace\_text=DEFAULT\_REPLACE\_TEXT)*
* subclass of *RegexProcessor*.
* *ranges*: iterable of instances of *UnicodeRange*.
## Built-in Processors
Following processors are defined by *UnicodeRange* and regex. Read the source code if you are sure about what's going on.
`text_cleaner.processor.common`, for common usage:
* `ALPHA`
* `DIGIT`
* `SYMBOLS_AND_PUNCTUATION`
* `ASCII`
* `ALPHA_EXTENSION`
* `DIGIT_EXTENSION`
* `SYMBOLS_AND_PUNCTUATION_EXTENSION`
* `GENERAL_PUNCTUATION`
`text_cleaner.processor.misc`, misellanious processors:
* `URL`
* `RESTRICT_URL`
* `ESCAPED_WHITESPACE`
* `WECHAT_EMOJI_EN`
* `WECHAT_EMOJI_ZHCN`
* `WECHAT_EMOJI`
`text_cleaner.processor.chinese`, Chinese processing:
* `CHINESE_CHARACTER`: only common characters.
* `CHINESE`: common characters + symbols and puntuations.
* `CHINESE_ALL`: all CJK characters.
* `CHINESE_EXTENSION`
* `CHINESE_COMPATIBILITY`
* `CHINESE_SYMBOLS_AND_PUNCTUATION`
### URL vs. RESTRICT_URL
How to define URLs is a complex problem.
We provide two choices for our users.
* `URL`: truncate urls till whitespaces.
* `RESTRICT_URL`: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)
For Chinese users, we recommend using `RESTRICT_URL`.
```python
from text_cleaner.processor.misc import RESTRICT_URL, URL
URL.remove('点击http://t.cn/RtU0mZ1 查看')
# '点击 查看'
URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 '
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 查看'
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text_cleaner-0.2.6.tar.gz
(14.5 kB
view details)
Built Distribution
File details
Details for the file text_cleaner-0.2.6.tar.gz
.
File metadata
- Download URL: text_cleaner-0.2.6.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 149117a31b5c03e224f956ad274648b9c8d37bf7c05234682728aaa4d6dc6367 |
|
MD5 | a8c5d2caf1f2117fa43ff45de5b1f1f8 |
|
BLAKE2b-256 | abd43fb4c54528a6a44ddc41b9a582af03ef2caef111d977b491d455b9cec1a7 |
File details
Details for the file text_cleaner-0.2.6-py2.py3-none-any.whl
.
File metadata
- Download URL: text_cleaner-0.2.6-py2.py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1edbf15f212593824f8f9ddd651b82f7735505576142fbc60b426b5b11de197c |
|
MD5 | 24f743de4d00aadbac541d437197f3e7 |
|
BLAKE2b-256 | b08b446922cde7c811dbb8fb71ffe75becd28c9fdb83e62438f6f64d1372e17d |