A lib for text preprocessing
Project description
# Plane
[![Build Status](https://travis-ci.org/kemingy/Plane.svg?branch=master)](https://travis-ci.org/kemingy/Plane)
> **Plane** is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.
> *from [Wikipedia](https://en.wikipedia.org/wiki/Plane_(tool))*
![plane(tool) from wikipedia](https://upload.wikimedia.org/wikipedia/commons/e/e3/Kanna2.gif)
This package is used for extracting or replacing specific parts from text, like URL, Email, HTML tags, telephone numbers and so on. Or just remove all unicode punctuations.
See the full [Documents](https://kemingy.github.io/Plane/).
## Install
Python **3.x** only.
### pip
```python
pip install plane
```
### Install from source
```sh
python setup.py install
```
## Features
* build-in regex patterns: `plane.pattern.Regex`
* custom regex patterns
* pattern combination
* extract, replace patterns
* segment sentence
* chain function calls: `plane.plane.Plane`
* pipeline: `plane.Pipeline`
## Usage
### Quick start
Use regex to `extract` or `replace`:
```python
from plane import EMAIL, extract, replace
text = 'fake@no.com & fakefake@nothing.com'
emails = extract(text, EMAIL) # this return a generator object
for e in emails:
print(e)
>>> Token(name='Email', value='fake@no.com', start=0, end=11)
>>> Token(name='Email', value='fakefake@nothing.com', start=14, end=34)
print(EMAIL)
>>> Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)', repl='<Email>')
replace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used
>>> '<Email> & <Email>'
replace(text, EMAIL, '')
>>> ' & '
```
### pattern
`Regex` is a namedtuple with 3 items:
* `name`
* `pattern`: Regular Expression
* `repl`: replacement tag, this will replace matched regex when using `replace` function
```python
# create new pattern
from plane import build_new_regex
custom_regex = build_new_regex('my_regex', r'(\d{4})', '<my-replacement-tag>')
```
Also, you can build new pattern from default patterns.
```python
from plane import extract, build_new_regex, CHINESE_WORDS
ASCII = build_new_regex('ascii', r'[a-zA-Z0-9]+', ' ')
WORDS = ASCII + CHINESE_WORDS
print(WORDS)
>>> Regex(name='ascii_Chinese_words', pattern='[a-zA-Z0-9]+|[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF]+', repl=' ')
text = "自然语言处理太难了!who can help me? (╯▔🔺▔)╯"
print(' '.join([t.value for t in list(extract(text, WORDS))]))
>>> "自然语言处理太难了 who can help me"
```
Default Regex: [Details](https://github.com/Momingcoder/Plane/blob/master/plane/pattern.py)
* `URL`: only ASCII
* `EMAIL`: local-part@domain
* `TELEPHONE`: like xxx-xxxx-xxxx
* `SPACE`: ` `, `\t`, `\n`, `\r`, `\f`, `\v`
* `HTML`: HTML tags, Script part and CSS part
* `ASCII_WORD`: English word, numbers, `<tag>` and so on.
* `CHINESE`: all Chinese characters (only Han and punctuations)
* `CJK`: all Chinese, Japanese, Korean(CJK) characters and punctuations
Regex name | replace
-----------|---------
URL | `'<URL>'`
EMAIL | `'<Email>'`
TELEPHONE | `'<Telephone>'`
SPACE | `' '`
HTML | `' '`
ASCII_WORD | `' '`
CHINESE | `' '`
CJK | `' '`
### segment
`segment` can be used to segment sentence, English and Numbers like 'PS4' will be keeped and others like Chinese '中文' will be split to single word format `['中', '文']`.
```python
from plane import segment
segment('你看起来guaiguai的。<EOS>')
>>> ['你', '看', '起', '来', 'guaiguai', '的', '。', '<EOS>']
```
### punctuation
`remove_punctuation` will replace all unicode punctuations to `' '` or something you send to this function as paramter `repl`.
**Attention**: '+', '^', '$', '~' and some chars are not punctuation.
```python
from plane import remove_punctuation
text = 'Hello world!'
remove_punctuation(text)
>>> 'Hello world '
# replace punctuation with special string
remove_punctuation(text, '<P>')
>>> 'Hello world<P>'
```
### Chain function
`Plane` contains `extract`, `replace`, `segment` and `remove_punctuation`, and these methods can be called in chain. Since `segment` returns list, it can only be called in the end of the chain.
`Plane.text` saves the result of processed text and `Plane.values` saves the result of extracted strings.
```python
from plane import Plane
from plane.pattern import EMAIL
p = Plane()
p.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values
>>> 'My email is .'
p.update('My email is my@email.com.').replace(EMAIL).segment()
>>> ['My', 'email', 'is', '<Email>', '.']
p.update('My email is my@email.com.').extract(EMAIL).values
>>> [Token(name='Email', value='my@email.com', start=12, end=24)]
```
### Pipeline
You can use `Pipeline` if you like.
`segment` and `extract` can only present in the end.
```python
from plane import Pipeline, replace, segment
from plane.pattern import URL
pipe = Pipeline()
pipe.add(replace, URL, '')
pipe.add(segment)
pipe('http://www.guokr.com is online.')
>>> ['is', 'online', '.']
[![Build Status](https://travis-ci.org/kemingy/Plane.svg?branch=master)](https://travis-ci.org/kemingy/Plane)
> **Plane** is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.
> *from [Wikipedia](https://en.wikipedia.org/wiki/Plane_(tool))*
![plane(tool) from wikipedia](https://upload.wikimedia.org/wikipedia/commons/e/e3/Kanna2.gif)
This package is used for extracting or replacing specific parts from text, like URL, Email, HTML tags, telephone numbers and so on. Or just remove all unicode punctuations.
See the full [Documents](https://kemingy.github.io/Plane/).
## Install
Python **3.x** only.
### pip
```python
pip install plane
```
### Install from source
```sh
python setup.py install
```
## Features
* build-in regex patterns: `plane.pattern.Regex`
* custom regex patterns
* pattern combination
* extract, replace patterns
* segment sentence
* chain function calls: `plane.plane.Plane`
* pipeline: `plane.Pipeline`
## Usage
### Quick start
Use regex to `extract` or `replace`:
```python
from plane import EMAIL, extract, replace
text = 'fake@no.com & fakefake@nothing.com'
emails = extract(text, EMAIL) # this return a generator object
for e in emails:
print(e)
>>> Token(name='Email', value='fake@no.com', start=0, end=11)
>>> Token(name='Email', value='fakefake@nothing.com', start=14, end=34)
print(EMAIL)
>>> Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)', repl='<Email>')
replace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used
>>> '<Email> & <Email>'
replace(text, EMAIL, '')
>>> ' & '
```
### pattern
`Regex` is a namedtuple with 3 items:
* `name`
* `pattern`: Regular Expression
* `repl`: replacement tag, this will replace matched regex when using `replace` function
```python
# create new pattern
from plane import build_new_regex
custom_regex = build_new_regex('my_regex', r'(\d{4})', '<my-replacement-tag>')
```
Also, you can build new pattern from default patterns.
```python
from plane import extract, build_new_regex, CHINESE_WORDS
ASCII = build_new_regex('ascii', r'[a-zA-Z0-9]+', ' ')
WORDS = ASCII + CHINESE_WORDS
print(WORDS)
>>> Regex(name='ascii_Chinese_words', pattern='[a-zA-Z0-9]+|[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF]+', repl=' ')
text = "自然语言处理太难了!who can help me? (╯▔🔺▔)╯"
print(' '.join([t.value for t in list(extract(text, WORDS))]))
>>> "自然语言处理太难了 who can help me"
```
Default Regex: [Details](https://github.com/Momingcoder/Plane/blob/master/plane/pattern.py)
* `URL`: only ASCII
* `EMAIL`: local-part@domain
* `TELEPHONE`: like xxx-xxxx-xxxx
* `SPACE`: ` `, `\t`, `\n`, `\r`, `\f`, `\v`
* `HTML`: HTML tags, Script part and CSS part
* `ASCII_WORD`: English word, numbers, `<tag>` and so on.
* `CHINESE`: all Chinese characters (only Han and punctuations)
* `CJK`: all Chinese, Japanese, Korean(CJK) characters and punctuations
Regex name | replace
-----------|---------
URL | `'<URL>'`
EMAIL | `'<Email>'`
TELEPHONE | `'<Telephone>'`
SPACE | `' '`
HTML | `' '`
ASCII_WORD | `' '`
CHINESE | `' '`
CJK | `' '`
### segment
`segment` can be used to segment sentence, English and Numbers like 'PS4' will be keeped and others like Chinese '中文' will be split to single word format `['中', '文']`.
```python
from plane import segment
segment('你看起来guaiguai的。<EOS>')
>>> ['你', '看', '起', '来', 'guaiguai', '的', '。', '<EOS>']
```
### punctuation
`remove_punctuation` will replace all unicode punctuations to `' '` or something you send to this function as paramter `repl`.
**Attention**: '+', '^', '$', '~' and some chars are not punctuation.
```python
from plane import remove_punctuation
text = 'Hello world!'
remove_punctuation(text)
>>> 'Hello world '
# replace punctuation with special string
remove_punctuation(text, '<P>')
>>> 'Hello world<P>'
```
### Chain function
`Plane` contains `extract`, `replace`, `segment` and `remove_punctuation`, and these methods can be called in chain. Since `segment` returns list, it can only be called in the end of the chain.
`Plane.text` saves the result of processed text and `Plane.values` saves the result of extracted strings.
```python
from plane import Plane
from plane.pattern import EMAIL
p = Plane()
p.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values
>>> 'My email is .'
p.update('My email is my@email.com.').replace(EMAIL).segment()
>>> ['My', 'email', 'is', '<Email>', '.']
p.update('My email is my@email.com.').extract(EMAIL).values
>>> [Token(name='Email', value='my@email.com', start=12, end=24)]
```
### Pipeline
You can use `Pipeline` if you like.
`segment` and `extract` can only present in the end.
```python
from plane import Pipeline, replace, segment
from plane.pattern import URL
pipe = Pipeline()
pipe.add(replace, URL, '')
pipe.add(segment)
pipe('http://www.guokr.com is online.')
>>> ['is', 'online', '.']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
plane-0.1.5.tar.gz
(10.6 kB
view details)
Built Distribution
File details
Details for the file plane-0.1.5.tar.gz
.
File metadata
- Download URL: plane-0.1.5.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c4f7715b1de5621342c271ede5295b0f3945bacd193f84119c0cbf2296f096f |
|
MD5 | 2fd9e15dc4bcff0b571a7e51c0f23de7 |
|
BLAKE2b-256 | a60ab1ccbb587a9e440d6622e132c06b3607e924b34689510629f97a10e38b0d |
File details
Details for the file plane-0.1.5-py2.py3-none-any.whl
.
File metadata
- Download URL: plane-0.1.5-py2.py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59a1cac4875d5634538671c088ebe7e79ab949d8e8e3e8a0147cf0cb33dade2b |
|
MD5 | 7a756a1adc87efcd3c2fadbb8a43fa6e |
|
BLAKE2b-256 | 5c060d82e6a2db87d79ab2adbbe874a3dc7c316362f19e9151863bb403b3eadd |