Skip to main content

图像去重工具包.

Project description

ImageDeduplication

图像去重

项目地址:https://github.com/firstelfin/ImageDeduplication PYPI地址:https://pypi.org/project/imgDedup/

自定义去重数据HashCode

数据类,包含了success、value(phash的hex值)、error、img_path属性。集成了__sub____repr____bool__to_dict方法。这是去重的基本数据结构。

去重核心算法-phash

代码路径:imgDedup/tools/imageFingerprint.py

imageFingerprint:get_phash 基于imagehash.phash实现了感知哈希算法,计算图像的指纹。函数的输入可以是图像路径,也可以是图像ndarray。

去重管线1--单个数据集去重

代码路径:imgDedup/utils/deduplication.py

deduplication:SelfDeduplication 类实现了单个数据集的去重。去重逻辑是并发加载每个图片的HashCode, 然后初始化一个保存的空列表,循环这些HashCode,如果HashCode与列表中的HashCode都不相似,则将此HashCode加入列表。最后返回列表中的HashCode。

使用案例:

>>> sd = SelfDeduplication(
...     src_dir=Path(f"xxxx"),
...     dst_dir=Path(f"xxxxx"),
...     use_link=True,
...     threshold=5,
...     hash_size=16
... )
>>> sd(save_json_path=Path(f"xxx/dedup测试/status/deduplication_record.json"))

去重管线2--多个数据集去重

代码路径:imgDedup/utils/deduplication.py

deduplication:CrossDatasetDeduplication 类实现了多个数据集的去重。去重逻辑是并发加载每个数据集的图片的HashCode, 然后初始化一个保存的空列表,循环这些HashCode,如果HashCode与列表中的HashCode都不相似,则将此HashCode加入列表。最后返回列表中的HashCode。

使用案例:

>>> mdl = MultiDeduplication(
...     src_dir=Path("xxx/xxxx_deduplication_record.json"),
...     dst_dir=Path("xxxxx/images"),
...     targets=[
...         Path("xxxxxx/images"),
...         Path("aaaaaa/images"),
...         Path("ssssss/images"),
...         Path("wwwwww/images"),
...         Path("ffffff/xxxxxx_deduplication_record.json"),
...         Path("ccccccc/eeeeee_deduplication_record.json"),
...     ],
...     threshold=26,
...     use_link=False,
...     hash_size=16
... )
... mdl(save_json_path=Path("xxxss--202508_deduplication_record.json"))

Install

源码安装:

>>> git clone https://github.com/firstelfin/ImageDeduplication.git
>>> cd &&pip install .

通过PYPI安装:

>>> pip install imgDedup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgdedup-1.0.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imgdedup-1.0.0-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file imgdedup-1.0.0.tar.gz.

File metadata

  • Download URL: imgdedup-1.0.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for imgdedup-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0d731c43123e766b484469b7621e5bf0aeb2cb4d90ae31f75b860eaa80bca6f1
MD5 c3c5a07ca2f317abe1f0cbce0e9c9383
BLAKE2b-256 9de8a783fb51883e2442e5992586e30f460a51192d90b7ca6f90123df796d3b7

See more details on using hashes here.

File details

Details for the file imgdedup-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: imgdedup-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for imgdedup-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2d8c771f7d203bdc1e736f6c7b0f3d608afb1466d1f77b12cf263a8c724686f4
MD5 5158fe317b918638a21752fd38a7cc7f
BLAKE2b-256 86f768fae34d393e658c62c9f0866761c3e333711b5aa3d66f0712cf363f4e45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page