A lib for text preprocessing

These details have not been verified by PyPI

Project links

Homepage

Project description

Plane

Plane is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.
from Wikipedia

plane(tool) from wikipedia

This package is used for extracting or replacing specific parts from text, like URL, Email, HTML tags, telephone numbers and so on. Also supports punctuation normalization and removement.

See the full Documents.

Install

Python 3.x only.

pip

pip install plane

Install from source

python setup.py install

Features

no other dependencies
build-in regex patterns: plane.pattern.Regex
custom regex patterns
pattern combination
extract, replace patterns
segment sentence
chain function calls: plane.plane.Plane
pipeline: plane.Pipeline

Usage

Quick start

Use regex to extract or replace:

from plane import EMAIL, extract, replace
text = 'fake@no.com & fakefake@nothing.com'

emails = extract(text, EMAIL) # this return a generator object
for e in emails:
    print(e)

>>> Token(name='Email', value='fake@no.com', start=0, end=11)
>>> Token(name='Email', value='fakefake@nothing.com', start=14, end=34)

print(EMAIL)

>>> Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)', repl='<Email>')

replace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used

>>> '<Email> & <Email>'

replace(text, EMAIL, '')

>>> ' & '

pattern

Regex is a namedtuple with 3 items:

name
pattern: Regular Expression
repl: replacement tag, this will replace matched regex when using replace function

# create new pattern
from plane import build_new_regex
custom_regex = build_new_regex('my_regex', r'(\d{4})', '<my-replacement-tag>')

Also, you can build new pattern from default patterns.

Attention: this should only be used for language range.

from plane import extract, build_new_regex, CHINESE_WORDS
ASCII = build_new_regex('ascii', r'[a-zA-Z0-9]+', ' ')
WORDS = ASCII + CHINESE_WORDS
print(WORDS)

>>> Regex(name='ascii_Chinese_words', pattern='[a-zA-Z0-9]+|[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF]+', repl=' ')

text = "自然语言处理太难了！who can help me? (╯▔🔺▔)╯"
print(' '.join([t.value for t in list(extract(text, WORDS))]))

>>> "自然语言处理太难了 who can help me"

from plane import CHINESE, ENGLISH, NUMBER
CN_EN_NUM = sum([CHINESE, ENGLISH, NUMBER])
text = "佛是虚名，道亦妄立。एवं मया श्रुतम्। 1999 is not the end of the world. "
print(' '.join([t.value for t in extract(text, CN_EN_NUM)]))

>>> "佛是虚名，道亦妄立。 1999 is not the end of the world."

Default Regex: Details

URL: only ASCII
EMAIL: local-part@domain
TELEPHONE: like xxx-xxxx-xxxx
SPACE: , \t, \n, \r, \f, \v
HTML: HTML tags, Script part and CSS part
ASCII_WORD: English word, numbers, <tag> and so on.
CHINESE: all Chinese characters (only Han and punctuations)
CJK: all Chinese, Japanese, Korean(CJK) characters and punctuations
THAI: all Thai and punctuations
VIETNAMESE: all Vietnames and punctuations
ENGLISH: all English chars and punctuations
NUMBER: 0-9

Regex name	replace
URL	`'<URL>'`
EMAIL	`'<Email>'`
TELEPHONE	`'<Telephone>'`
SPACE	`' '`
HTML	`' '`
ASCII_WORD	`' '`
CHINESE	`' '`
CJK	`' '`

segment

segment can be used to segment sentence, English and Numbers like 'PS4' will be keeped and others like Chinese '中文' will be split to single word format ['中', '文'].

from plane import segment
segment('你看起来guaiguai的。<EOS>')
>>> ['你', '看', '起', '来', 'guaiguai', '的', '。', '<EOS>']

punctuation

punc.remove will replace all unicode punctuations to ' ' or something you send to this function as paramter repl. punc.normalize will normalize some Unicode punctuations to English punctuations.

Attention: '+', '^', '$', '~' and some chars are not punctuation.

from plane import punc

text = 'Hello world!'
punc.remove(text)

>>> 'Hello world '

# replace punctuation with special string
punc.remove(text, '<P>')

>>> 'Hello world<P>'

# normalize punctuations
punc.normalize('你读过那本《边城》吗？什么编程？！人生苦短，我用 Python。')

>>> '你读过那本(边城)吗?什么编程?!人生苦短,我用 Python.'

Chain function

Plane contains extract, replace, segment and punc.remove, punc.normalize, and these methods can be called in chain. Since segment returns list, it can only be called in the end of the chain.

Plane.text saves the result of processed text and Plane.values saves the result of extracted strings.

from plane import Plane
from plane.pattern import EMAIL

p = Plane()
p.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values

>>> 'My email is .'

p.update('My email is my@email.com.').replace(EMAIL).segment()

>>> ['My', 'email', 'is', '<Email>', '.']

p.update('My email is my@email.com.').extract(EMAIL).values

>>> [Token(name='Email', value='my@email.com', start=12, end=24)]

Pipeline

You can use Pipeline if you like.

segment and extract can only present in the end.

from plane import Pipeline, replace, segment
from plane.pattern import URL

pipe = Pipeline()
pipe.add(replace, URL, '')
pipe.add(segment)
pipe('http://www.guokr.com is online.')

>>> ['is', 'online', '.']

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.1

Jan 20, 2021

0.2.0

Apr 24, 2019

0.1.9

Apr 12, 2019

0.1.8

Mar 5, 2019

0.1.7

Mar 5, 2019

0.1.6

Mar 4, 2019

0.1.5

Feb 13, 2019

0.1.4

Dec 29, 2018

0.1.3

Dec 29, 2018

0.1.2

Dec 29, 2018

0.1.1

Jul 24, 2018

0.1.0

May 14, 2018

0.0.8

Apr 16, 2018

0.0.7

Apr 16, 2018

0.0.6

Apr 14, 2018

0.0.5

Apr 12, 2018

0.0.4

Apr 10, 2018

0.0.3

Mar 9, 2018

0.0.2

Mar 9, 2018

0.0.1

Feb 27, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plane-0.2.1.tar.gz (13.2 kB view details)

Uploaded Jan 20, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plane-0.2.1-py3-none-any.whl (11.4 kB view details)

Uploaded Jan 20, 2021 Python 3

File details

Details for the file plane-0.2.1.tar.gz.

File metadata

Download URL: plane-0.2.1.tar.gz
Upload date: Jan 20, 2021
Size: 13.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for plane-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f84a07343212ec69bee9a1765d14020cb2682e0664f50e44e03d3fef593b9a18`
MD5	`16cabcdbb811b08634ce2566e1e1bc7f`
BLAKE2b-256	`8c93adea35ac917c1d2d96384204d573c8f39dc6c6893ea54a5758c401c76ba3`

See more details on using hashes here.

File details

Details for the file plane-0.2.1-py3-none-any.whl.

File metadata

Download URL: plane-0.2.1-py3-none-any.whl
Upload date: Jan 20, 2021
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for plane-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5bf80b7e4078bb65c42f94a77e6669b71b4530784063d6fb0cab796e1d9cdd3c`
MD5	`123b45a0298f971a3736ce52dfc92cb1`
BLAKE2b-256	`7c442cef3fb807d7db6d3ccd0c8359a6d817b4478321889a00065b9b501a3623`

See more details on using hashes here.

plane 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Plane

Install

pip

Install from source

Features

Usage

Quick start

pattern

segment

punctuation

Chain function

Pipeline

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes