Skip to main content

文本、文档相似性计算

Project description

simtext

simtext可以计算两文档间四大文本相似性指标,分别为:

  • Sim_Cosine cosine相似性
  • Sim_Jaccard Jaccard相似性
  • Sim_MinEdit 最小编辑距离

具体算法介绍可翻看Cohen, Lauren, Christopher Malloy&Quoc Nguyen(2018) 第60页

安装

pip install simtext

使用

中文文本相似性

from simtext import similarity

text1 = '在宏观经济背景下,为继续优化贷款结构,重点发展可以抵抗经济周期不良的贷款'
text2 = '在宏观经济背景下,为继续优化贷款结构,重点发展可三年专业化、集约化、综合金融+物联网金融四大金融特色的基础上'

sim = similarity()
res = sim.compute(text1, text2)
print(res)

Run

{'Sim_Cosine': 0.46475800154489, 
'Sim_Jaccard': 0.3333333333333333, 
'Sim_MinEdit': 29, }

英文文本相似性

from simtext import similarity

A = 'We expect demand to increase.'
B = 'We expect worldwide demand to increase.'
C = 'We expect weakness in sales'

sim = similarity()
AB = sim.compute(A, B)
AC = sim.compute(A, C)

print(AB)
print(AC)

Run

{'Sim_Cosine': 0.9128709291752769, 
'Sim_Jaccard': 0.8333333333333334, 
'Sim_MinEdit': 2, 
'Sim_Simple': 0.9545454545454546}

{'Sim_Cosine': 0.39999999999999997, 
'Sim_Jaccard': 0.25, 
'Sim_MinEdit': 4}

参考文献

Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.

如果

如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,

  • python入门
  • 网络爬虫
  • 数据读取
  • 文本分析入门
  • 机器学习与文本分析
  • 文本分析在经管研究中的应用

感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~

更多

支持一下

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for simtext, version 1.1
Filename, size File type Python version Upload date Hashes
Filename, size simtext-1.1.macosx-10.9-x86_64.tar.gz (5.2 kB) File type Source Python version None Upload date Hashes View
Filename, size simtext-1.1-py3-none-any.whl (4.1 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page