pnlp · PyPI

A pre-processing tool for NLP.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Table of Contents generated with DocToc

Features
Install
Usage
- Iopipe
  - IO process
  - Built-in Method
- Text
  - Clean and Extract
  - Regex
- Cut
- Enhancement
- Normalization
- StopWords
- Length
- Magic
- Concurring
Test
ChangeLog

This is a pre-processing tool for NLP.

Features

A flexible pipe line for text io
A flexible tool for text clean and extract
Text enhancement
Sentence cut and Chinese character cut
Text bucket
Chinese character normalization
Kinds of length
Stopwords
Some magic usage in pre-processing
Tools like Concurring, generating batches

Install

Need Python3.7+.

pip install pnlp

Usage

Iopipe

IO process

tree tests/piop_data/
├── a.md
├── b.txt
├── c.data
├── first
│   ├── fa.md
│   ├── fb.txt
│   ├── fc.data
│   └── second
│       ├── sa.md
│       ├── sb.txt
│       └── sc.data
├── json.json
├── outfile.file
├── outjson.json
└── yml.yml

import os
from pnlp import Reader

DATA_PATH = "./pnlp/tests/piop_data/"
pattern = '*.md' # also could be '*.txt', 'f*.*', etc. SUPPORT regex
reader = Reader(pattern, use_regex=True)

# Get lines of all files in one directory with line index and file name
for line in reader(DATA_FOLDER_PATH):
    print(line.lid, line.fname, line.text)
"""
0 a.md line 1 in a.
1 a.md line 2 in a.
2 a.md line 3 in a.
0 fa.md line 1 in fa.
1 fa.md line 2 in fa
...
"""

# Get lines of one file lines with line index and file name
# if a file is read, the `pattern` is not effective
for line in reader(os.path.join(DATA_FOLDER_PATH, "a.md")):
    print(line.lid, line.fname, line.text)
"""
0 a.md line 1 in a.
1 a.md line 2 in a.
2 a.md line 3 in a.
"""

# Get all filepaths in one directory
for path in reader.gen_files(DATA_PATH, pattern):
    print(path)
"""
pnlp/tests/piop_data/a.md
pnlp/tests/piop_data/first/fa.md
pnlp/tests/piop_data/first/second/sa.md
"""

# Get content(article) of all files in one directory with file name
paths = reader.gen_files(DATA_PATH, pattern)
articles = reader.gen_articles(paths)
for article in articles:
    print(article.fname)
    print(article.f.read())
"""
a.md
line 1 in a.
line 2 in a.
line 3 in a.
...
"""

# Get lines of all files in one directory with line index and file name
# the same as ip.Reader(DATA_PATH, pattern)
paths = reader.gen_files(DATA_PATH, pattern)
articles = reader.gen_articles(paths)
for line in reader.gen_flines(articles):
    print(line.lid, line.fname, line.text)

Built-in Method

import pnlp

# Read
file_string = pnlp.read_file(file_path)
file_list = pnlp.read_lines(file_path)
file_json = pnlp.read_json(file_path)
file_yaml = pnlp.read_yaml(file_path)
file_csv = pnlp.read_csv(file_path)

# Write
pnlp.write_json(file_path, data)
pnlp.write_file(file_path, data)

# Others
pnlp.check_dir(dirname) # will make dirname if not exist

Text

Clean and Extract

import re

# Use Text
from pnlp import Text

text = "这是https://www.yam.gift长度测试，《 》*)FSJfdsjf😁![](http://xx.jpg)。233."
pattern = re.compile(r'\d+')

# pattern is re.Pattern or str type
# Default is '', means do not use any pattern (acctually is re.compile(r'.+'). In this pattern, clean returns nothing, extract returns the origin.
# If pattern is a string, a build-in pattern will be used, there are 11 types:
#	'chi': Chinese character
#	'pun': Punctuations
#	'whi': White space
#	'nwh': Non White space
#	'wnb': Word and number
#	'nwn': Non word and number
#	'eng': English character
#	'num': Number
#	'pic': Pictures
#	'lnk': Links
#	'emj': Emojis

pt = Text(['chi', pattern])
# pt.extract will return matches and their locations
res = pt.extract(text)

print(res)
"""
{'text': '这是长度测试233', 'mats': ['这是', '长度测试', '233'], 'locs': [(0, 2), (22, 26), (60, 63)]}
"""

print(res.text, res.mats, res.locs)
"""
'这是长度测试' ['这是', '长度测试'] [(0, 2), (22, 26)]
"""
# pt.clean will return cleaned text using the pattern
print(pt.clean(text))
"""
https://www.yam.gift，《 》*)FSJfdsjf😁![](http://xx.jpg)。233.
"""

pt = Text(['pic', 'lnk'])
res = pt.extract(text)

print(res.mats)
"""
['https://www.yam.gif',
 '![](http://xx.jpg)',
 'https://www.yam.gift',
 'http://xx.jpg']
"""

print(pt.clean(text))
"""
这是t长度测试，《 》*)FSJfdsjf😁。233.
"""

Regex

# USE Regex
from pnlp import Regex
reg = Regex()
def clean_text(text: str) -> str:
    text = reg.pwhi.sub("", text)
    text = reg.pemj.sub("", text)
    text = reg.ppic.sub("", text)
    text = reg.plnk.sub("", text)
    return text

Cut

AnypartCut

# Cut by Regex
from pnlp import cut_part, psent
text = "你好！欢迎使用。"
sent_list = cut_part(text, psent, with_spliter=True, with_offset=False)
print(sent_list)
"""
['你好！', '欢迎使用。']
"""
pcustom_sent = re.compile(r'[。！]')
sent_list = cut_part(text, pcustom_sent, with_spliter=False, with_offset=False)
print(sent_list)
"""
['你好', '欢迎使用']
"""
sent_list = cut_part(text, pcustom_sent, with_spliter=False, with_offset=True)
print(sent_list)
"""
[('你好', 0, 3), ('欢迎使用', 3, 8)]
"""

SentenceCut

# Cut Sentence
from pnlp import cut_sentence as pcs
text = "你好！欢迎使用。"
sent_list = pcs(text)
print(sent_list)
"""
['你好！', '欢迎使用。']
"""

ChineseCharCut

# Cut to Chinese chars
from pnlp import cut_zhchar
text = "你好，hello, 520 i love u. = ”我爱你“。"
char_list = cut_zhchar(text)
print(char_list)
"""
['你', '好', '，', 'hello', ',', ' ', '520', ' ', 'i', ' ', 'love', ' ', 'u', '.', ' ', '=', ' ', '”', '我', '爱', '你', '“', '。']
"""
char_list = cut_zhchar(text, remove_blank=True)
print(char_list)
"""
['你', '好', '，', 'hello', ',', '520', 'i', 'love', 'u', '.', '=', '”', '我', '爱', '你', '“', '。']
"""

CombineBucket

from pnlp import combine_bucket
parts = [
    '习近平指出',
    '中方不仅维护中国人民生命安全和身体健康',
    '也维护世界人民生命安全和身体健康',
    '我们本着公开',
    '透明'
]
buckets = combine_bucket(parts.copy(), 10, truncate=True, keep_remain=True)
print(buckets)
"""
['习近平指出', 
'中方不仅维护中国人民', 
'生命安全和身体健康', 
'也维护世界人民生命安', 
'全和身体健康', 
'我们本着公开透明']
"""

Enhancement

# Both Sampler support delete, swap and insert sampling method types.
text = "人为什么活着？生而为人必须要有梦想！还要有尽可能多的精神体验。"
# TokenLevel
from pnlp import TokenLevelSampler
tls = TokenLevelSampler()
tls.make_samples(text)
"""
{'delete': '人为什么活着？生而为人必须要梦想！还要有尽可能多的精神体验。',
 'swap': '为人什么活着？生而为人必须要有梦想！还要有尽可能多的精神体验。',
 'insert': '人为什么活着？生而为人必须要有梦想！还还要有尽可能多的精神体验。',
 'together': '人什么着着活？生而必为为须要有梦想！还要有尽可能多的精神体验。'}
"""
# tokenizer is supported
tls.make_samples(text, jieba.lcut)
"""
{'delete': '人为什么活着？生而为人要有梦想！还要有尽可能多的精神体验。',
 'swap': '为什么人活着？生而为人必须要有梦想！还要有尽可能多的精神体验。',
 'insert': '人为什么活着？生而为人必须要有梦想！还要还要有尽可能多的精神体验。',
 'together': '人为什么活着？生而为人人要有梦想！还要有多尽可能的精神体验。'}
"""
# SentenceLevel
from pnlp import SentenceLevelSampler
sls = SentenceLevelSampler()
sls.make_samples(text)
"""
{'delete': '生而为人必须要有梦想！还要有尽可能多的精神体验。',
 'swap': '人为什么活着？还要有尽可能多的精神体验。生而为人必须要有梦想！',
 'insert': '人为什么活着？还要有尽可能多的精神体验。生而为人必须要有梦想！生而为人必须要有梦想！',
 'together': '生而为人必须要有梦想！人为什么活着？人为什么活着？'}
"""

TokenLevelSampler Note:

It uses a default tokenizer for Chinese (Chinese Char Tokenizer) and English (Simple Whitespace Tokenizer).
The tokenizer could be anyone you like, but the output should be a list of tokens or a list of tuple pairs, each pair include a token and a part-of-speech.
It uses stopwords as default sample words and function part-of-speech as default sample pos. This means we only sampling those tokens who are in the sample words or their pos are in the sample pos (if they just have a pos). You could customize them as you like.

Normalization

from pnlp import num_norm
num_norm.num2zh(1024) == "一千零二十四"
num_norm.num2zh(1024).to_money() == "壹仟零贰拾肆"
num_norm.zh2num("一千零二十四") == 1024

Transformation

# entity bio to entities
from pnlp import pick_entity_from_bio_labels
pairs = ["天 B-LOC", "安 I-LOC", "门 I-LOC", "有 O", "毛 B-PER", "主 I-PER", "席 I-PER"]
pick_entity_from_bio_labels(pairs)
[('天安门', 'LOC'), ('毛主席', 'PER')]

StopWords

from pnlp import StopWords, chinese_stopwords, english_stopwords

csw = StopWords("/path/to/custom/stopwords.txt")
csw.stopwords # a set of the custom stopwords

csw.zh == chinese_stopwords # Chineses stopwords
csw.en == english_stopwords # English stopwords

Length

from pnlp import Length

text = "这是https://www.yam.gift长度测试，《 》*)FSJfdsjf😁![](http://xx.jpg)。233."

pl = Length(text)
# Note that even a pattern is used, the length is always for the raw text.
# Length is counted by character, not entire word or number.
print("Length of all characters: ", pl.len_all)
print("Length of all non-white characters: ", pl.len_nwh)
print("Length of all Chinese characters: ", pl.len_chi)
print("Length of all words and numbers: ", pl.len_wnb)
print("Length of all punctuations: ", pl.len_pun)
print("Length of all English characters: ", pl.len_eng)
print("Length of all numbers: ", pl.len_num)

"""
Length of all characters:  64
Length of all non-white characters:  63
Length of all Chinese characters:  6
Length of all words and numbers:  41
Length of all punctuations:  14
Length of all English characters:  32
Length of all numbers:  3
"""

Magic

from pnlp import MagicDict

# Nest dict
pmd = MagicDict()
pmd['a']['b']['c'] = 2
print(pmd)

"""
{'a': {'b': {'c': 2}}}
"""

# Preserve all repeated value-keys when a Dict is reversed.
dx = {1: 'a',
      2: 'a',
      3: 'a',
      4: 'b' }
print(pmag.MagicDict.reverse(dx))

"""
{'a': [1, 2, 3], 'b': 4}
"""

Concurring

import math
def is_prime(x):
    if x < 2:
        return False
    for i in range(2, int(math.sqrt(x)) + 1):
        if x % i == 0:
            return False
    return True

from pnlp import concurring
@concurring
def get_primes(lst):
    res = []
    for i in lst:
        if is_prime(i):
            res.append(i)
    return res
@concurrint(type="thread", max_workers=10)
def get_primes(lst):
    pass

concurring wrapper just make your original function concurring.

Test

Clone the repo run:

$ python -m pytest

ChangeLog

v0.4.1

Add bio label => entity

v0.4.1

Remove annotation re.Pattern

v0.4.0

Make dataclass their right usage.

v0.3.11

Adjust MagicDict and check_dir.

v0.3.10

Fix piop strip.

v0.3.9

Reader support regex.

v0.3.8

Fix concurring for multiple processing.

v0.3.7

Add concurring and batch generator

v0.3.5

Add text enhancement.

v0.3.3/4

Fix url link and picture Regex pattern.

v0.3.2

Fix cut_part for sentence ends with a white space and a full stop.

v0.3.1

Add cut_part to cut text to any parts by the given Regex Pattern; Add combine_bucket to combine any parts to buckets by the given threshold(length).

v0.3.0

Update cut_sentence; Add NumNorm.

v0.28-29

Update cut_zhchar.

v0.27

Add cut_zhchar.

v0.26

Add read_csv, remove ； as a sentence cut standard.

v0.25

Add stop_words.

v0.24

Fix read_json.

v0.23

Fix Text default rule.

v0.22

Make Text more convenient to use.

v0.21

Add cut_sentence method.

v0.20

Optimize several interface and make Text accept list of Regular Expression Patterns.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.4.16

Mar 4, 2025

0.4.15

Sep 2, 2024

0.4.14

Jul 18, 2024

0.4.13

Jun 29, 2024

0.4.12

Jun 20, 2024

0.4.11

Jun 9, 2024

0.4.10

Jan 12, 2024

0.4.9

Mar 3, 2023

0.4.8

Dec 25, 2022

0.4.7

Jul 4, 2022

0.4.6

May 6, 2022

0.4.5

Apr 18, 2022

0.4.4

Apr 16, 2022

0.4.3

Apr 9, 2022

This version

0.4.2

Mar 22, 2022

0.4.1

Jan 13, 2022

0.4.0

Dec 27, 2021

0.3.11

Oct 16, 2021

0.3.10

Sep 28, 2021

0.3.9

Aug 12, 2021

0.3.8

Apr 21, 2021

0.3.7

Apr 18, 2021

0.3.5

Nov 11, 2020

0.3.4

Aug 20, 2020

0.3.3

Aug 20, 2020

0.3.2

Jul 27, 2020

0.3.1

Jul 23, 2020

0.3.0

Jul 9, 2020

0.0.2

Apr 19, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pnlp-0.4.2.tar.gz (27.6 kB view details)

Uploaded Mar 22, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pnlp-0.4.2-py3-none-any.whl (32.9 kB view details)

Uploaded Mar 22, 2022 Python 3

File details

Details for the file pnlp-0.4.2.tar.gz.

File metadata

Download URL: pnlp-0.4.2.tar.gz
Upload date: Mar 22, 2022
Size: 27.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.27.0 CPython/3.8.10

File hashes

Hashes for pnlp-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`bbe16898f011fd3e1a87ef2e4df057179a6d11b1e4e19872acee7a4899a9e4ee`
MD5	`b659e9a1d829ae04157683ac8ac28850`
BLAKE2b-256	`619d16e426315b351338915340db89dc26e42f6cf13d06165488b955ece3f973`

See more details on using hashes here.

File details

Details for the file pnlp-0.4.2-py3-none-any.whl.

File metadata

Download URL: pnlp-0.4.2-py3-none-any.whl
Upload date: Mar 22, 2022
Size: 32.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.27.0 CPython/3.8.10

File hashes

Hashes for pnlp-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b6e9928a07bf5ba00487c135e1bd0c876641c031b25404e8760556a774d9e66c`
MD5	`ce4c68274f5f4e7d068845115ff6e42a`
BLAKE2b-256	`866c135334542fbd0e946de00dbce315848815b472cfc3f9931f2df6e9a9f569`

See more details on using hashes here.

pnlp 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Install

Usage

Iopipe

IO process

Built-in Method

Text

Clean and Extract

Regex

Cut

AnypartCut

SentenceCut

ChineseCharCut

CombineBucket

Enhancement

Normalization

Transformation

StopWords

Length

Magic

Concurring

Test

ChangeLog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes