Skip to main content

# Remove duplicates 重复内容筛选

Project description

Remove duplicates 重复内容筛选

tkitSimhash zh

根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。

from tkitSimhash import simHash
sim=simHash()
text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against hordes of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter.  \nRelated: Screenshots From The New Resident Evil Have Leaked  \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, the Resident Evil name sure has the clout needed to get people to pay attention to the new series.  \n  \nCapcom has been experimenting with multiplayer in its Resident Evil games for years. This dates all the way back to Resident Evil ."""
text2 = """, in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against  of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter.  \nRelated: Screenshots From The New Resident Evil Have Leaked  \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, its Resident Evil games for years. This dates all the way back to Resident Evil  """
a = sim.simhash(text1)
b = sim.simhash(text2)

# print(a)
print("拆分子码,子码至少存在一个一样的才需要计算相关度")
code_a=sim.autoencode([text1])[0]
print(code_a)
code_b=sim.autoencode([text2])[0]
print(code_b)
# print(sim.subcode(a))

# print(b)
# print(sim.subcode(b))


sim.similarity(code_a['code'],code_b['code']),sim.getdistance(code_a['code'],code_b['code'])

拆分子码,子码至少存在一个一样的才需要计算相关度 {'subcode': ['1101100011001100', '1010110001010111', '0101101101110111', '0001111011011101'], 'code': '1101100011001100101011000101011101011011011101110001111011011101'} {'subcode': ['1101100110001100', '1010110001010111', '0001111101110111', '0001111011011101'], 'code': '1101100110001100101011000101011100011111011101110001111011011101'} (0.999999910089919, 4)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tkitSimhash-0.0.1.3.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tkitSimhash-0.0.1.3-py2.py3-none-any.whl (6.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file tkitSimhash-0.0.1.3.tar.gz.

File metadata

  • Download URL: tkitSimhash-0.0.1.3.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for tkitSimhash-0.0.1.3.tar.gz
Algorithm Hash digest
SHA256 5f5e1d286b10815cb0fd1c29e6c4d2d5d009afcf62c8ecc603ab1a4df7cbde7b
MD5 aaafb8fffb676f281a4a435362ae8adf
BLAKE2b-256 917b4f85b0be1a5620095275c011cc2bb18a48b5a89d6c59ba35e64bd027d566

See more details on using hashes here.

File details

Details for the file tkitSimhash-0.0.1.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for tkitSimhash-0.0.1.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 39301c9d3d322076d6962c8cc703954a30469ade9485a5ba8c4e8daa1f0f5164
MD5 3c0c75250a2853ef4cb9a928ad31f28d
BLAKE2b-256 c8f102ae9b86205b2a5ddc8a5078b89dee2495bc58a78f5a4547a7b698139a88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page