A lib for text preprocessing
Project description
Plane
=====
|Build Status|
| **Plane** is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.
| -- from `Wikipedia <https://en.wikipedia.org/wiki/Plane_(tool)>`_.
.. figure:: https://upload.wikimedia.org/wikipedia/commons/e/e3/Kanna2.gif
:alt: plane(tool) from wikipedia
This package is used for extracting or replacing specific parts from
text, like URL, Email, HTML tags, telephone numbers and so on. Or just
remove all unicode punctuations.
Install
-------
Python **3.x** only.
pip
~~~
.. code:: sh
pip install plane
Install from source
~~~~~~~~~~~~~~~~~~~
.. code:: sh
python setup.py install
Usage
-----
Features
---------
* build-in regex patterns: :class:`plane.pattern.Regex`
* custom regex patterns
* extract, replace patterns
* segment sentence
* chain function calls: :class:`plane.plane.Plane`
Why we need this?
------------------------
In NLP(Natural language processing) task, cleaning text data may be one of the most boring things. `Plane` is built for this.
* extract content from web page source
* detect urls, emails, telephone numbers
* split sentence composed of Chinese and English
* remove all punctuations to get pure text
Usage
---------
Only support Python3.
`extract` and `replace`
~~~~~~~~~~~~~~~~~~~~~~~~~~
::
from plane import EMAIL, extract, replace
text = 'fake@no.com & fakefake@nothing.com'
emails = extract(text, EMAIL) # this return a generator object
for e in emails:
print(e)
>>> Token(name='Email', value='fake@no.com', start=0, end=11)
>>> Token(name='Email', value='fakefake@nothing.com', start=14, end=34)
print(EMAIL)
>>> Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)', repl='<Email>')
replace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used
>>> '<Email> & <Email>'
replace(text, EMAIL, '')
>>> ' & '
`segment`
~~~~~~~~~~~~~~~~
`segment` can be used to segment sentence, English and Numbers like 'PS4' will be keeped and others like Chinese '中文' will be split to single word format `['中', '文']`.
::
from plane import segment
segment('你看起来guaiguai的。<EOS>')
>>> ['你', '看', '起', '来', 'guaiguai', '的', '。', '<EOS>']
replace all punctuations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`remove_punctuation` will replace all unicode punctuations to `' '` or something you send to this function as paramter `repl`.
**Attention**: '+', '^', '$', '~' and some chars are not punctuation.
::
from plane import remove_punctuation
text = 'Hello world!'
remove_punctuation(text)
>>> 'Hello world '
# replace punctuation with special string
remove_punctuation(text, '<P>')
>>> 'Hello world<P>'
chain function calls
~~~~~~~~~~~~~~~~~~~~~~~~
`Plane` contains `extract`, `replace`, `segment` and `remove_punctuation`, and these methods can be called in chain. Since `segment` returns list, it can only be called in the end of the chain.
`Plane.text` saves the result of processed text and `Plane.values` saves the result of extracted strings.
::
from plane import Plane
from plane.pattern import EMAIL
p = Plane()
p.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values
>>> 'My email is .'
p.update('My email is my@email.com.').replace(EMAIL).segment()
>>> ['My', 'email', 'is', '<Email>', '.']
p.update('My email is my@email.com.').extract(EMAIL).values
>>> [Token(name='Email', value='my@email.com', start=12, end=24)]
.. |Build Status| image:: https://travis-ci.org/Momingcoder/Plane.svg?branch=master
:target: https://travis-ci.org/Momingcoder/Plane
=====
|Build Status|
| **Plane** is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.
| -- from `Wikipedia <https://en.wikipedia.org/wiki/Plane_(tool)>`_.
.. figure:: https://upload.wikimedia.org/wikipedia/commons/e/e3/Kanna2.gif
:alt: plane(tool) from wikipedia
This package is used for extracting or replacing specific parts from
text, like URL, Email, HTML tags, telephone numbers and so on. Or just
remove all unicode punctuations.
Install
-------
Python **3.x** only.
pip
~~~
.. code:: sh
pip install plane
Install from source
~~~~~~~~~~~~~~~~~~~
.. code:: sh
python setup.py install
Usage
-----
Features
---------
* build-in regex patterns: :class:`plane.pattern.Regex`
* custom regex patterns
* extract, replace patterns
* segment sentence
* chain function calls: :class:`plane.plane.Plane`
Why we need this?
------------------------
In NLP(Natural language processing) task, cleaning text data may be one of the most boring things. `Plane` is built for this.
* extract content from web page source
* detect urls, emails, telephone numbers
* split sentence composed of Chinese and English
* remove all punctuations to get pure text
Usage
---------
Only support Python3.
`extract` and `replace`
~~~~~~~~~~~~~~~~~~~~~~~~~~
::
from plane import EMAIL, extract, replace
text = 'fake@no.com & fakefake@nothing.com'
emails = extract(text, EMAIL) # this return a generator object
for e in emails:
print(e)
>>> Token(name='Email', value='fake@no.com', start=0, end=11)
>>> Token(name='Email', value='fakefake@nothing.com', start=14, end=34)
print(EMAIL)
>>> Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)', repl='<Email>')
replace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used
>>> '<Email> & <Email>'
replace(text, EMAIL, '')
>>> ' & '
`segment`
~~~~~~~~~~~~~~~~
`segment` can be used to segment sentence, English and Numbers like 'PS4' will be keeped and others like Chinese '中文' will be split to single word format `['中', '文']`.
::
from plane import segment
segment('你看起来guaiguai的。<EOS>')
>>> ['你', '看', '起', '来', 'guaiguai', '的', '。', '<EOS>']
replace all punctuations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`remove_punctuation` will replace all unicode punctuations to `' '` or something you send to this function as paramter `repl`.
**Attention**: '+', '^', '$', '~' and some chars are not punctuation.
::
from plane import remove_punctuation
text = 'Hello world!'
remove_punctuation(text)
>>> 'Hello world '
# replace punctuation with special string
remove_punctuation(text, '<P>')
>>> 'Hello world<P>'
chain function calls
~~~~~~~~~~~~~~~~~~~~~~~~
`Plane` contains `extract`, `replace`, `segment` and `remove_punctuation`, and these methods can be called in chain. Since `segment` returns list, it can only be called in the end of the chain.
`Plane.text` saves the result of processed text and `Plane.values` saves the result of extracted strings.
::
from plane import Plane
from plane.pattern import EMAIL
p = Plane()
p.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values
>>> 'My email is .'
p.update('My email is my@email.com.').replace(EMAIL).segment()
>>> ['My', 'email', 'is', '<Email>', '.']
p.update('My email is my@email.com.').extract(EMAIL).values
>>> [Token(name='Email', value='my@email.com', start=12, end=24)]
.. |Build Status| image:: https://travis-ci.org/Momingcoder/Plane.svg?branch=master
:target: https://travis-ci.org/Momingcoder/Plane
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
plane-0.1.4.tar.gz
(9.2 kB
view details)
Built Distribution
File details
Details for the file plane-0.1.4.tar.gz
.
File metadata
- Download URL: plane-0.1.4.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec56be799e034f8c4f5ad965283d22cf21dfe8767ea9632c9579dae09dd78dc9 |
|
MD5 | d103662c6d222ca4f9d0f9ecaa2a1342 |
|
BLAKE2b-256 | 6083a4bc5af2338362d42ac3e3dd48ef83ccc0f4d5e2346a59e2da16586db743 |
File details
Details for the file plane-0.1.4-py2.py3-none-any.whl
.
File metadata
- Download URL: plane-0.1.4-py2.py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 791d191227f2ea15e5a259612c7431787c2ce29ca1ae460707b60a453f3b09f9 |
|
MD5 | 42b962fb4a9a2fb36926af2d3c166ca7 |
|
BLAKE2b-256 | e0d69e4a6000232d55934b3022bf316779fd8a53568c094618d3b9c755212b31 |