# Remove duplicates 重复内容筛选 tkitSimhash zh 根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。 ```python from tkitSimhash import simHash sim=simHash() text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not
Project description
Remove duplicates 重复内容筛选
tkitSimhash zh
根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。
from tkitSimhash import simHash
sim=simHash()
text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against hordes of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter. \nRelated: Screenshots From The New Resident Evil Have Leaked \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, the Resident Evil name sure has the clout needed to get people to pay attention to the new series. \n \nCapcom has been experimenting with multiplayer in its Resident Evil games for years. This dates all the way back to Resident Evil ."""
text2 = """, in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter. \nRelated: Screenshots From The New Resident Evil Have Leaked \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, its Resident Evil games for years. This dates all the way back to Resident Evil """
a = sim.simhash(text1)
b = sim.simhash(text2)
# print(a)
print("拆分子码,子码至少存在一个一样的才需要计算相关度")
code_a=sim.autoencode([text1])[0]
print(code_a)
code_b=sim.autoencode([text2])[0]
print(code_b)
# print(sim.subcode(a))
# print(b)
# print(sim.subcode(b))
sim.similarity(code_a['code'],code_b['code']),sim.getdistance(code_a['code'],code_b['code'])
拆分子码,子码至少存在一个一样的才需要计算相关度 {'subcode': ['1101100011001100', '1010110001010111', '0101101101110111', '0001111011011101'], 'code': '1101100011001100101011000101011101011011011101110001111011011101'} {'subcode': ['1101100110001100', '1010110001010111', '0001111101110111', '0001111011011101'], 'code': '1101100110001100101011000101011100011111011101110001111011011101'} (0.999999910089919, 4)
update
0.0.1.6 修正依赖 pytest==7.1.3和nltk
0.0.1.5 修正依赖 pytest==7.1.3和nltk
0.0.1.4
修改word列表为文本
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tkitSimhash-0.0.1.7-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 904ba07d3aaafe1b3a97269d05ae0dc71231519b9461cd58fa0a197f70a1a5c2 |
|
MD5 | 804da5b1667656bbf64b7afc6d5143b5 |
|
BLAKE2b-256 | 753ae2cecafff7f89e432b83d8864a16149bbe307c20e88b81f00f78991f22c0 |