Skip to main content

# Remove duplicates 重复内容筛选 tkitSimhash zh 根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。 ```python from tkitSimhash import simHash sim=simHash() text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not

Project description

Remove duplicates 重复内容筛选

tkitSimhash zh

根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。

from tkitSimhash import simHash
sim=simHash()
text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against hordes of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter.  \nRelated: Screenshots From The New Resident Evil Have Leaked  \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, the Resident Evil name sure has the clout needed to get people to pay attention to the new series.  \n  \nCapcom has been experimenting with multiplayer in its Resident Evil games for years. This dates all the way back to Resident Evil ."""
text2 = """, in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against  of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter.  \nRelated: Screenshots From The New Resident Evil Have Leaked  \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, its Resident Evil games for years. This dates all the way back to Resident Evil  """
a = sim.simhash(text1)
b = sim.simhash(text2)

# print(a)
print("拆分子码,子码至少存在一个一样的才需要计算相关度")
code_a=sim.autoencode([text1])[0]
print(code_a)
code_b=sim.autoencode([text2])[0]
print(code_b)
# print(sim.subcode(a))

# print(b)
# print(sim.subcode(b))


sim.similarity(code_a['code'],code_b['code']),sim.getdistance(code_a['code'],code_b['code'])

拆分子码,子码至少存在一个一样的才需要计算相关度 {'subcode': ['1101100011001100', '1010110001010111', '0101101101110111', '0001111011011101'], 'code': '1101100011001100101011000101011101011011011101110001111011011101'} {'subcode': ['1101100110001100', '1010110001010111', '0001111101110111', '0001111011011101'], 'code': '1101100110001100101011000101011100011111011101110001111011011101'} (0.999999910089919, 4)

update


0.0.1.6 修正依赖 pytest==7.1.3和nltk

0.0.1.5 修正依赖 pytest==7.1.3和nltk

0.0.1.4

修改word列表为文本

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tkitSimhash-0.0.1.9.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

tkitSimhash-0.0.1.9-py2.py3-none-any.whl (6.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tkitSimhash-0.0.1.9.tar.gz.

File metadata

  • Download URL: tkitSimhash-0.0.1.9.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for tkitSimhash-0.0.1.9.tar.gz
Algorithm Hash digest
SHA256 6206ed590796ae3fc26268d74908c631dba5ad98288510ca0d015335cbfc0009
MD5 f6ffc5aa263a46634b9c1f8b7a100b31
BLAKE2b-256 713de02890d2d0178ab77e75ef8e76fa3acc1620363df3078f5f7251f5a968db

See more details on using hashes here.

File details

Details for the file tkitSimhash-0.0.1.9-py2.py3-none-any.whl.

File metadata

  • Download URL: tkitSimhash-0.0.1.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for tkitSimhash-0.0.1.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2b07211229352ee23e7de38b91bc86f0bd0c19715c80072c0c98dd5727156611
MD5 76dfa38b4b18ad8627d7b75fa7bedc65
BLAKE2b-256 2cbb596e5ad25d197479b649903081de9fcfc971c71ddcc0190a818050852cfe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page