Skip to main content

No project description provided

Project description

自然语言处理工具包

from duowen_huqie import NLP

nlp = NLP()

text = "Apache Spark 是一个用于大规模数据处理的统一分析引擎。它提供了 Java、Scala、Python 和 R 的高级 API,以及支持通用执行图的优化引擎。它还支持包括 Spark SQL 用于 SQL 和结构化数据处理、Spark 上的 pandas API 用于 pandas 工作负载、MLlib 用于机器学习、GraphX 用于图处理以及 Structured Streaming 用于增量计算和流处理的丰富高级工具集。"

# 粗切
print(nlp.content_cut(text))

# 细切
print(nlp.content_sm_cut(text))

# 新增词条
nlp.tok_add_word("分析引擎", 1000, "nr")

# 删除词条
nlp.tok_del_word("分析引擎")

# 更新词条
nlp.tok_update_word("分析引擎", 1000, "n")

# 查询词性
print(nlp.tok_tag_word("数据"))

# 词条查询权重
print(nlp.term_weight("大数据平台使用的什么数据引擎"))

query = "什么是混合召回?"

documents = ["混合召回是一种结合文本召回和向量召回的方法。",
             "文本召回通过关键词匹配实现,向量召回通过语义相似度实现。",
             "混合召回可以提高搜索的准确性和覆盖率。", ]

query_vector = [...]  # 向量需要外部计算
docs_vector = [[...], [...], [...]]  # 向量需要外部计算

# 文本相似度
print(nlp.text_similarity(question=query, docs=documents))

# 问句文本相似度(去除停词)
print(nlp.query_text_similarity(question=query, docs=documents))

# 混合相似度
print(nlp.hybrid_similarity(question=query, question_vector=query_vector, docs_vector=docs_vector, docs=documents))

# 问句混合相似度(去除停词)
print(
    nlp.query_hybrid_similarity(question=query, question_vector=query_vector, docs_vector=docs_vector, docs=documents))

# 向量相似度
print(nlp.vector_similarity(question_vector=query_vector, docs_vector=docs_vector))

# 新词发现
from duowen_huqie.new_word_detection import NewWordDetection

nw = NewWordDetection(nlp)
result, new_word = nw.find_word('高祖,沛豐邑中陽裏人也,姓劉氏。母媼嘗息大澤之陂,夢與神遇。是時雷電晦冥 ,父太公往視,則見交龍於上。已而有娠,遂產高祖。高祖為人,隆准而龍顏,美須髯,左股有七十二黑子。寬仁愛人,意豁如也。常有 大度,不事家人生產作業。及壯,試吏,為泗上亭長,延中吏無所不狎侮。好酒及色。 常從王媼、武負貰酒,時飲醉臥,武負、王媼見其上常有怪。', 3, 5)
for k, v in new_word.items():
    print(k, v)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duowen_huqie-0.1.9.tar.gz (29.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duowen_huqie-0.1.9-py3-none-any.whl (29.7 MB view details)

Uploaded Python 3

File details

Details for the file duowen_huqie-0.1.9.tar.gz.

File metadata

  • Download URL: duowen_huqie-0.1.9.tar.gz
  • Upload date:
  • Size: 29.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for duowen_huqie-0.1.9.tar.gz
Algorithm Hash digest
SHA256 754d19e2c4d824c173a9b2801ea94357d5478125c90db7bebc6907b27f5b0339
MD5 e6076ca6cad62b8e2b4c571ccab85e28
BLAKE2b-256 4e7639a1198ee503396b644944e00c77fc12d75bab60e828af6baa2acc605846

See more details on using hashes here.

File details

Details for the file duowen_huqie-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: duowen_huqie-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 29.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for duowen_huqie-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 7767526f142c15cab49b01930fa9feb2ca04e1da1d14deb1e3d67fb16a3fe33e
MD5 bcf70645c37c7bc736b7f162a9bccc83
BLAKE2b-256 cbbc3914793a3aa5f9d1a67d71a06498ab47aefbfe6a51dcca620c58ac1732cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page