Skip to main content

文本、文档相似性计算

Project description

simtext

simtext可以计算两文档间四大文本相似性指标,分别为:

  • Sim_Cosine cosine相似性
  • Sim_Jaccard Jaccard相似性
  • Sim_MinEdit 最小编辑距离
  • Sim_Simple 微软Word中的track changes

具体算法介绍可翻看Cohen, Lauren, Christopher Malloy&Quoc Nguyen(2018) 第60页

安装

pip install simtext

使用

中文文本相似性

from simtext import similarity

text1 = '在宏观经济背景下,为继续优化贷款结构,重点发展可以抵抗经济周期不良的贷款'
text2 = '在宏观经济背景下,为继续优化贷款结构,重点发展可三年专业化、集约化、综合金融+物联网金融四大金融特色的基础上'

sim = similarity()
res = sim.compute(text1, text2)
print(res)

Run

{'Sim_Cosine': 0.46475800154489, 
'Sim_Jaccard': 0.3333333333333333, 
'Sim_MinEdit': 29, 
'Sim_Simple': 0.9889595182335229}

英文文本相似性

from simtext import similarity

A = 'We expect demand to increase.'
B = 'We expect worldwide demand to increase.'
C = 'We expect weakness in sales'

sim = similarity()
AB = sim.compute(A, B)
AC = sim.compute(A, C)

print(AB)
print(AC)

Run

{'Sim_Cosine': 0.9128709291752769, 
'Sim_Jaccard': 0.8333333333333334, 
'Sim_MinEdit': 2, 
'Sim_Simple': 0.9545454545454546}

{'Sim_Cosine': 0.39999999999999997, 
'Sim_Jaccard': 0.25, 
'Sim_MinEdit': 4, 
'Sim_Simple': 0.9315789473684211}

参考文献

Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.

如果

如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,

  • python入门
  • 网络爬虫
  • 数据读取
  • 文本分析入门
  • 机器学习与文本分析
  • 文本分析在经管研究中的应用

感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~

更多

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simtext-1.3.macosx-10.9-x86_64.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

simtext-1.3-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file simtext-1.3.macosx-10.9-x86_64.tar.gz.

File metadata

  • Download URL: simtext-1.3.macosx-10.9-x86_64.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5

File hashes

Hashes for simtext-1.3.macosx-10.9-x86_64.tar.gz
Algorithm Hash digest
SHA256 6c0c7b0f66f9c1bac76a0791ef1af1fcca335ba9bf66cad33b7570aee13bc1d3
MD5 6edbc59835962c5d77a286be1ffa74b3
BLAKE2b-256 c9e2bcac788da63c1c87b8060f849457a8122494782bf83c16698dfc87dde8af

See more details on using hashes here.

File details

Details for the file simtext-1.3-py3-none-any.whl.

File metadata

  • Download URL: simtext-1.3-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5

File hashes

Hashes for simtext-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ab82707e42cc9528431e2e976a82804cd64172586733a35d8de0afa9733fdcec
MD5 5d0064a6af93e8b97099f25e33f706fc
BLAKE2b-256 3846a0270214a8c497675dfdb36a0f830d3888e2001bf51534a47942ea03ccf6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page