Skip to main content

Fast text processing acceleration.

Project description

pybolt

Fast text processing acceleration. 一个快速的文本处理及NLP工具.

当前0.0.1测试版:

  • 纯python实现
  • 实现了关键词查找和替换功能
  • 实现了任意维的词汇共现判别
  • 实现了从海量语料无监督生成分词词库
  • 注意事项:为了兼容中英混合语料,提取关键词不适用于英文词中包含更小的英文词的情况;

安装pybolt

pip install py-bolt

使用试例

Extract keywords

from pybolt import bolt_text
bolt_text.add_keywords(["清华", "清华大学"])
found_words = bolt_text.extract_keywords("我收到了清华大学的录取通知书.")
print(found_words)
# ['清华', '清华大学']
found_words = bolt_text.extract_keywords("我收到了清华大学的录取通知书.", longest_only=True)
print(found_words)
# ['清华大学']

Batch extract keywords

from pybolt import bolt_text
def get_lines():
    yield "我考上了清华大学"
    yield "我梦见我考上了清华大学"

bolt_text.add_keywords(["清华", "清华大学"])
for df in bolt_text.batch_extract_keywords(get_lines(), concurrency=10000000):
    for _, row in df.iterrows():
        print(row.example, row.keywords)

Replace keywords

from pybolt import bolt_text
bolt_text.add_replace_map({"清华大学": "北京大学"})
sentence = bolt_text.replace_keywords("我收到了清华大学的录取通知书.")
print(sentence)
# "我收到了北京大学的录取通知书."

Batch replace keywords

from pybolt import bolt_text

def get_lines():
    yield "我考上了清华大学"
    yield "我梦见我考上了清华大学"

bolt_text.add_replace_map({"清华大学": "北京大学"})
for df in bolt_text.batch_extract_keywords(get_lines(), concurrency=10000000):
    for _, row in df.iterrows():
        print(row.example)

Co-occurrence word recognition

from pybolt import bolt_text
bolt_text.add_co_occurrence_words(["小明", "清华"], "高考")
res, tag = bolt_text.is_co_occurrence("小明考上了清华大学")
print(res, tag)
# True 高考

Batch text processor

from pybolt import bolt_text
def get_lines():
    yield "小明考上了清华大学"
    yield "小明做梦的时候考上了清华大学"
    yield "大明做梦的时候考上了清华大学"
def my_processor(line):
    if line.startswith("小明"):
        return True
    return None

for df in bolt_text.batch_text_processor(get_lines(), my_processor):
    df = df[df["processor_result"].notna()]
    print(df.head())

Text normalize

from pybolt import bolt_text
print(bolt_text.normalize("⓪⻆🈚"))

Text clean

import re
from pybolt import bolt_text
_pattern = re.compile("([^\u4E00-\u9FD5\u9FA6-\u9FEF\u3400-\u4DB5a-zA-Z0-9 +]+)", re.U)
print(bolt_text.clean("aaaaa+++++.....abcadf    ga   a", pattern=_pattern, pattern_replace="", normalize=True, crc_cut=3))

Word discover

from pybolt.bolt_nlp import WordDiscover
wd = WordDiscover()
wd.word_discover(["examples.txt"])
# will save the new_words.vocab in execution directory

性能

测试了关键词查找功能,单句速度相对flashtext提升了30%,批操作速度相对flashtext提升了260%.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-bolt-0.0.24.tar.gz (666.8 kB view details)

Uploaded Source

Built Distribution

py_bolt-0.0.24-py3-none-any.whl (686.4 kB view details)

Uploaded Python 3

File details

Details for the file py-bolt-0.0.24.tar.gz.

File metadata

  • Download URL: py-bolt-0.0.24.tar.gz
  • Upload date:
  • Size: 666.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for py-bolt-0.0.24.tar.gz
Algorithm Hash digest
SHA256 9e7205a2e85295d0fcce499561efe351e25be6877d5ffd6f434730ce56e4f2f5
MD5 ebd64a5c038a76ca6a1bd7319b081eab
BLAKE2b-256 ede45caa43a3137113b7c2186fb46785351fc02c72d6636ce9b68726593bdb88

See more details on using hashes here.

File details

Details for the file py_bolt-0.0.24-py3-none-any.whl.

File metadata

  • Download URL: py_bolt-0.0.24-py3-none-any.whl
  • Upload date:
  • Size: 686.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for py_bolt-0.0.24-py3-none-any.whl
Algorithm Hash digest
SHA256 b7d3fbb60c5eb37457bafe8db912d92253fa1233aa2aadbe0fc5a54723c0005b
MD5 371cd5625bde5f3555491eed8aaee6a3
BLAKE2b-256 b18539e3935735c901520ebe7840266dbf3656bcda14f2ea780fd930f62a0ab9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page