文本、文档相似性计算
Project description
simtext
simtext可以计算两文档间四大文本相似性指标,分别为:
- Sim_Cosine cosine相似性
- Sim_Jaccard Jaccard相似性
- Sim_MinEdit 最小编辑距离
- Sim_Simple 微软Word中的track changes
具体算法介绍可翻看Cohen, Lauren, Christopher Malloy&Quoc Nguyen(2018) 第60页
安装
pip install simtext
使用
中文文本相似性
from simtext import similarity
text1 = '在宏观经济背景下,为继续优化贷款结构,重点发展可以抵抗经济周期不良的贷款'
text2 = '在宏观经济背景下,为继续优化贷款结构,重点发展可三年专业化、集约化、综合金融+物联网金融四大金融特色的基础上'
sim = similarity()
res = sim.compute(text1, text2)
print(res)
Run
{'Sim_Cosine': 0.46475800154489,
'Sim_Jaccard': 0.3333333333333333,
'Sim_MinEdit': 29,
'Sim_Simple': 0.9889595182335229}
英文文本相似性
from simtext import similarity
A = 'We expect demand to increase.'
B = 'We expect worldwide demand to increase.'
C = 'We expect weakness in sales'
sim = similarity()
AB = sim.compute(A, B)
AC = sim.compute(A, C)
print(AB)
print(AC)
Run
{'Sim_Cosine': 0.9128709291752769,
'Sim_Jaccard': 0.8333333333333334,
'Sim_MinEdit': 2,
'Sim_Simple': 0.9545454545454546}
{'Sim_Cosine': 0.39999999999999997,
'Sim_Jaccard': 0.25,
'Sim_MinEdit': 4,
'Sim_Simple': 0.9315789473684211}
参考文献
Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.
如果
如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,
- python入门
- 网络爬虫
- 数据读取
- 文本分析入门
- 机器学习与文本分析
- 文本分析在经管研究中的应用
感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~
更多
-
公众号:大邓和他的python
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
simtext-1.3-py3-none-any.whl
(4.9 kB
view details)
File details
Details for the file simtext-1.3.macosx-10.9-x86_64.tar.gz
.
File metadata
- Download URL: simtext-1.3.macosx-10.9-x86_64.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c0c7b0f66f9c1bac76a0791ef1af1fcca335ba9bf66cad33b7570aee13bc1d3 |
|
MD5 | 6edbc59835962c5d77a286be1ffa74b3 |
|
BLAKE2b-256 | c9e2bcac788da63c1c87b8060f849457a8122494782bf83c16698dfc87dde8af |
File details
Details for the file simtext-1.3-py3-none-any.whl
.
File metadata
- Download URL: simtext-1.3-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab82707e42cc9528431e2e976a82804cd64172586733a35d8de0afa9733fdcec |
|
MD5 | 5d0064a6af93e8b97099f25e33f706fc |
|
BLAKE2b-256 | 3846a0270214a8c497675dfdb36a0f830d3888e2001bf51534a47942ea03ccf6 |