Skip to main content

No project description provided

Project description

自然语言处理工具包

from duowen_huqie import NLP

nlp = NLP()

text = "Apache Spark 是一个用于大规模数据处理的统一分析引擎。它提供了 Java、Scala、Python 和 R 的高级 API,以及支持通用执行图的优化引擎。它还支持包括 Spark SQL 用于 SQL 和结构化数据处理、Spark 上的 pandas API 用于 pandas 工作负载、MLlib 用于机器学习、GraphX 用于图处理以及 Structured Streaming 用于增量计算和流处理的丰富高级工具集。"

# 粗切
print(nlp.content_cut(text))

# 细切
print(nlp.content_sm_cut(text))

# 新增词条
nlp.tok_add_word("分析引擎", 1000, "nr")

# 删除词条
nlp.tok_del_word("分析引擎")

# 更新词条
nlp.tok_update_word("分析引擎", 1000, "n")

# 查询词性
print(nlp.tok_tag_word("数据"))

# 词条查询权重
print(nlp.term_weight("大数据平台使用的什么数据引擎"))

query = "什么是混合召回?"

documents = ["混合召回是一种结合文本召回和向量召回的方法。",
             "文本召回通过关键词匹配实现,向量召回通过语义相似度实现。",
             "混合召回可以提高搜索的准确性和覆盖率。", ]

query_vector = [...]  # 向量需要外部计算
docs_vector = [[...], [...], [...]]  # 向量需要外部计算

# 文本相似度
print(nlp.text_similarity(question=query, docs=documents))

# 问句文本相似度(去除停词)
print(nlp.query_text_similarity(question=query, docs=documents))

# 混合相似度
print(nlp.hybrid_similarity(question=query, question_vector=query_vector, docs_vector=docs_vector, docs=documents))

# 问句混合相似度(去除停词)
print(
    nlp.query_hybrid_similarity(question=query, question_vector=query_vector, docs_vector=docs_vector, docs=documents))

# 向量相似度
print(nlp.vector_similarity(question_vector=query_vector, docs_vector=docs_vector))

# 新词发现
from duowen_huqie.new_word_detection import NewWordDetection

nw = NewWordDetection(nlp)
result, new_word = nw.find_word('高祖,沛豐邑中陽裏人也,姓劉氏。母媼嘗息大澤之陂,夢與神遇。是時雷電晦冥 ,父太公往視,則見交龍於上。已而有娠,遂產高祖。高祖為人,隆准而龍顏,美須髯,左股有七十二黑子。寬仁愛人,意豁如也。常有 大度,不事家人生產作業。及壯,試吏,為泗上亭長,延中吏無所不狎侮。好酒及色。 常從王媼、武負貰酒,時飲醉臥,武負、王媼見其上常有怪。', 3, 5)
for k, v in new_word.items():
    print(k, v)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duowen_huqie-0.1.10.tar.gz (29.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duowen_huqie-0.1.10-py3-none-any.whl (29.7 MB view details)

Uploaded Python 3

File details

Details for the file duowen_huqie-0.1.10.tar.gz.

File metadata

  • Download URL: duowen_huqie-0.1.10.tar.gz
  • Upload date:
  • Size: 29.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for duowen_huqie-0.1.10.tar.gz
Algorithm Hash digest
SHA256 ef07f417daf7090f0c7f54b02ea6326dd90df9b346a745f5c910a58aa2d3e1a7
MD5 74f6581e3243117a5798d3c2ea53e0df
BLAKE2b-256 b2b8fb5e1f2d5b1d85b46ae371a35cfdc06c2632efce106eb1f9fef49d4603c3

See more details on using hashes here.

File details

Details for the file duowen_huqie-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: duowen_huqie-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 29.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for duowen_huqie-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 57ef7f586944e99939ba1602398901266bc11f88657f164e608196bade75ad6f
MD5 f0484fa64ee89bc4021a759fcdd3e368
BLAKE2b-256 629a8ed41cce2da597090f156c352fd0fc446ae165e9f2bf8886b7687342a584

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page