Fast text processing acceleration.
Project description
pybolt
Fast text processing acceleration. 一个快速的文本处理及NLP工具.
当前0.0.1测试版:
- 纯python实现
- 实现了关键词查找和替换功能
- 实现了任意维的词汇共现判别
- 实现了从海量语料无监督生成分词词库
- 注意事项:为了兼容中英混合语料,提取关键词不适用于英文词中包含更小的英文词的情况;
安装pybolt
pip install py-bolt
使用试例
Extract keywords
from pybolt import bolt_text
bolt_text.add_keywords(["清华", "清华大学"])
found_words = bolt_text.extract_keywords("我收到了清华大学的录取通知书.")
print(found_words)
# ['清华', '清华大学']
found_words = bolt_text.extract_keywords("我收到了清华大学的录取通知书.", longest_only=True)
print(found_words)
# ['清华大学']
Batch extract keywords
from pybolt import bolt_text
def get_lines():
yield "我考上了清华大学"
yield "我梦见我考上了清华大学"
bolt_text.add_keywords(["清华", "清华大学"])
for df in bolt_text.batch_extract_keywords(get_lines(), concurrency=10000000):
for _, row in df.iterrows():
print(row.example, row.keywords)
Replace keywords
from pybolt import bolt_text
bolt_text.add_replace_map({"清华大学": "北京大学"})
sentence = bolt_text.replace_keywords("我收到了清华大学的录取通知书.")
print(sentence)
# "我收到了北京大学的录取通知书."
Batch replace keywords
from pybolt import bolt_text
def get_lines():
yield "我考上了清华大学"
yield "我梦见我考上了清华大学"
bolt_text.add_replace_map({"清华大学": "北京大学"})
for df in bolt_text.batch_extract_keywords(get_lines(), concurrency=10000000):
for _, row in df.iterrows():
print(row.example)
Co-occurrence word recognition
from pybolt import bolt_text
bolt_text.add_co_occurrence_words(["小明", "清华"], "高考")
res, tag = bolt_text.is_co_occurrence("小明考上了清华大学")
print(res, tag)
# True 高考
Batch text processor
from pybolt import bolt_text
def get_lines():
yield "小明考上了清华大学"
yield "小明做梦的时候考上了清华大学"
yield "大明做梦的时候考上了清华大学"
def my_processor(line):
if line.startswith("小明"):
return True
return None
for df in bolt_text.batch_text_processor(get_lines(), my_processor):
df = df[df["processor_result"].notna()]
print(df.head())
Text normalize
from pybolt import bolt_text
print(bolt_text.normalize("⓪⻆🈚"))
Text clean
import re
from pybolt import bolt_text
_pattern = re.compile("([^\u4E00-\u9FD5\u9FA6-\u9FEF\u3400-\u4DB5a-zA-Z0-9 +]+)", re.U)
print(bolt_text.clean("aaaaa+++++.....abcadf ga a", pattern=_pattern, pattern_replace="", normalize=True, crc_cut=3))
Word discover
from pybolt.bolt_nlp import WordDiscover
wd = WordDiscover()
wd.word_discover(["examples.txt"])
# will save the new_words.vocab in execution directory
性能
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
py-bolt-0.0.24.tar.gz
(666.8 kB
view details)
Built Distribution
py_bolt-0.0.24-py3-none-any.whl
(686.4 kB
view details)
File details
Details for the file py-bolt-0.0.24.tar.gz
.
File metadata
- Download URL: py-bolt-0.0.24.tar.gz
- Upload date:
- Size: 666.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e7205a2e85295d0fcce499561efe351e25be6877d5ffd6f434730ce56e4f2f5 |
|
MD5 | ebd64a5c038a76ca6a1bd7319b081eab |
|
BLAKE2b-256 | ede45caa43a3137113b7c2186fb46785351fc02c72d6636ce9b68726593bdb88 |
File details
Details for the file py_bolt-0.0.24-py3-none-any.whl
.
File metadata
- Download URL: py_bolt-0.0.24-py3-none-any.whl
- Upload date:
- Size: 686.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7d3fbb60c5eb37457bafe8db912d92253fa1233aa2aadbe0fc5a54723c0005b |
|
MD5 | 371cd5625bde5f3555491eed8aaee6a3 |
|
BLAKE2b-256 | b18539e3935735c901520ebe7840266dbf3656bcda14f2ea780fd930f62a0ab9 |