Skip to main content

Fast text processing acceleration.

Project description

pybolt

Fast text processing acceleration. 一个快速的文本处理及NLP工具.

当前0.0.1测试版:

  • 纯python实现
  • 实现了关键词查找和替换功能
  • 实现了任意维的词汇共现判别
  • 实现了从海量语料无监督生成分词词库
  • 注意事项:为了兼容中英混合语料,提取关键词不适用于英文词中包含更小的英文词的情况;

安装pybolt

pip install py-bolt

使用试例

Extract keywords

from pybolt import bolt_text
bolt_text.add_keywords(["清华", "清华大学"])
found_words = bolt_text.extract_keywords("我收到了清华大学的录取通知书.")
print(found_words)
# ['清华', '清华大学']
found_words = bolt_text.extract_keywords("我收到了清华大学的录取通知书.", longest_only=True)
print(found_words)
# ['清华大学']

Batch extract keywords

from pybolt import bolt_text
def get_lines():
    yield "我考上了清华大学"
    yield "我梦见我考上了清华大学"

bolt_text.add_keywords(["清华", "清华大学"])
for df in bolt_text.batch_extract_keywords(get_lines(), concurrency=10000000):
    for _, row in df.iterrows():
        print(row.example, row.keywords)

Replace keywords

from pybolt import bolt_text
bolt_text.add_replace_map({"清华大学": "北京大学"})
sentence = bolt_text.replace_keywords("我收到了清华大学的录取通知书.")
print(sentence)
# "我收到了北京大学的录取通知书."

Batch replace keywords

from pybolt import bolt_text

def get_lines():
    yield "我考上了清华大学"
    yield "我梦见我考上了清华大学"

bolt_text.add_replace_map({"清华大学": "北京大学"})
for df in bolt_text.batch_extract_keywords(get_lines(), concurrency=10000000):
    for _, row in df.iterrows():
        print(row.example)

Co-occurrence word recognition

from pybolt import bolt_text
bolt_text.add_co_occurrence_words(["小明", "清华"], "高考")
res, tag = bolt_text.is_co_occurrence("小明考上了清华大学")
print(res, tag)
# True 高考

Batch text processor

from pybolt import bolt_text
def get_lines():
    yield "小明考上了清华大学"
    yield "小明做梦的时候考上了清华大学"
    yield "大明做梦的时候考上了清华大学"
def my_processor(line):
    if line.startswith("小明"):
        return True
    return None

for df in bolt_text.batch_text_processor(get_lines(), my_processor):
    df = df[df["processor_result"].notna()]
    print(df.head())

Text normalize

from pybolt import bolt_text
print(bolt_text.normalize("⓪⻆🈚"))

Text clean

import re
from pybolt import bolt_text
_pattern = re.compile("([^\u4E00-\u9FD5\u9FA6-\u9FEF\u3400-\u4DB5a-zA-Z0-9 +]+)", re.U)
print(bolt_text.clean("aaaaa+++++.....abcadf    ga   a", pattern=_pattern, pattern_replace="", normalize=True, crc_cut=3))

Word discover

from pybolt.bolt_nlp import WordDiscover
wd = WordDiscover()
wd.word_discover(["examples.txt"])
# will save the new_words.vocab in execution directory

性能

测试了关键词查找功能,单句速度相对flashtext提升了30%,批操作速度相对flashtext提升了260%.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-bolt-0.0.24.tar.gz (666.8 kB view hashes)

Uploaded Source

Built Distribution

py_bolt-0.0.24-py3-none-any.whl (686.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page