文本、文档相似性计算
Project description
simtext
simtext可以计算两文档间四大文本相似性指标,分别为:
- Sim_Cosine cosine相似性
- Sim_Jaccard Jaccard相似性
- Sim_MinEdit 最小编辑距离
- Sim_Simple 微软Word中的track changes
具体算法介绍可翻看Cohen, Lauren, Christopher Malloy&Quoc Nguyen(2018) 第60页
安装
pip install simtext
使用
中文文本相似性
from simtext import similarity
text1 = '在宏观经济背景下,为继续优化贷款结构,重点发展可以抵抗经济周期不良的贷款'
text2 = '在宏观经济背景下,为继续优化贷款结构,重点发展可三年专业化、集约化、综合金融+物联网金融四大金融特色的基础上'
sim = similarity()
res = sim.compute(text1, text2)
print(res)
Run
{'Sim_Cosine': 0.46475800154489,
'Sim_Jaccard': 0.3333333333333333,
'Sim_MinEdit': 29,
'Sim_Simple': 0.9889595182335229}
英文文本相似性
from simtext import similarity
A = 'We expect demand to increase.'
B = 'We expect worldwide demand to increase.'
C = 'We expect weakness in sales'
sim = similarity()
AB = sim.compute(A, B)
AC = sim.compute(A, C)
print(AB)
print(AC)
Run
{'Sim_Cosine': 0.9128709291752769,
'Sim_Jaccard': 0.8333333333333334,
'Sim_MinEdit': 2,
'Sim_Simple': 0.9545454545454546}
{'Sim_Cosine': 0.39999999999999997,
'Sim_Jaccard': 0.25,
'Sim_MinEdit': 4,
'Sim_Simple': 0.9315789473684211}
参考文献
Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.
如果
如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,
- python入门
- 网络爬虫
- 数据读取
- 文本分析入门
- 机器学习与文本分析
- 文本分析在经管研究中的应用
感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~
更多
-
公众号:大邓和他的python
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simtext-1.3.macosx-10.9-x86_64.tar.gz.
File metadata
- Download URL: simtext-1.3.macosx-10.9-x86_64.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c0c7b0f66f9c1bac76a0791ef1af1fcca335ba9bf66cad33b7570aee13bc1d3
|
|
| MD5 |
6edbc59835962c5d77a286be1ffa74b3
|
|
| BLAKE2b-256 |
c9e2bcac788da63c1c87b8060f849457a8122494782bf83c16698dfc87dde8af
|
File details
Details for the file simtext-1.3-py3-none-any.whl.
File metadata
- Download URL: simtext-1.3-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab82707e42cc9528431e2e976a82804cd64172586733a35d8de0afa9733fdcec
|
|
| MD5 |
5d0064a6af93e8b97099f25e33f706fc
|
|
| BLAKE2b-256 |
3846a0270214a8c497675dfdb36a0f830d3888e2001bf51534a47942ea03ccf6
|