Skip to main content

SNIP: compact index for large dataset

Project description

snip-dedup

PyPI - Version linting - Ruff format - Black license - MIT

SNIP is a very compact index (25GB) that has found roughly half a billion duplicates on the LAION-2B-en dataset. You may download the de-duplicated dataset below.

SNIP de-duplicated L2B on a standard home computer, taking just several days. We believe the community will benefit from such a dataset, in light of recent research showing the copyright and privacy risks associated with training generative models on highly duplicated datasets, as well as SNIP for a de-duplication, compression and retrieval tool.

Install

pip install --upgrade snip-dedup

Usage

# List available commands
snip --help
snip download --help

# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10

Then, you may download (deduplicated) laion2b images with the awesome img2dataset.

You may check the fidelity of the duplicates by randomly sampling labeled duplicates, and using SNIP to detect its dup. You may do that with retrieve_dup_urls_demo.py (note you will need the original metadata files for this)

Roadmap

You can also do with SNIP (coming soon...)

  • Train SNIP Indices on your features
  • Download full or sharded SNIP indices for various CLIP networks
  • Do semantic search with extremely compact indices (25 GB or less) on billions of images
  • Compress your features with SNIP descriptors
  • Read our research paper

About

** DISCLAIMER ** Use at your own risk. Help for better de-duiplication (higher acc, higher recall) is very much appreciated. Taking raw CLIP features as the ground truth for exact duplicates, we get nearly 81% precision (and likely much higher for near duplicates, see below).

We release this index for public use and exploration of the LAION-2B-en dataset (more indices coming soon). Soon we will release tools to train your own SNIP indices as well as our scientific paper discussing the method in more detail.

You may find the following necessary files here:

Binary array of De-duplicated Images

SNIP index

SNIP descriptor

Other:

cumulative sizes of features (for indexing sharded files)

Finding images overfit by Stable Diffusion

By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:

sylvester stallone hopped up logo

Note on False positives

We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:

Chess1 Chess2

you may check a list of (randomly sampled) detected duplicate pairs here

Semantic Search

SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors.

Contribute

This python project uses the hatch project manager. Dependencies are specified inside the pyproject.toml file, and build configs inside the hatch.toml file. As such you can enter the isolated development environment with hatch shell from inside the repository.

To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Once you have installed them, you can check that the code is consistent with:

hatch run check  # check for mistakes via static analysis
hatch run format # check formatting of all python files
hatch run lint   # check linting rules

TODO: check pyright, formatting and linter in CI

[ ] CI [ ] check max file size on CI to prevent pushing data [ ] add docs. numpy docstring standard https://numpydoc.readthedocs.io/en/latest/format.html [ ] auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml [ ] add tests?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snip_dedup-0.0.4.tar.gz (60.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snip_dedup-0.0.4-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file snip_dedup-0.0.4.tar.gz.

File metadata

  • Download URL: snip_dedup-0.0.4.tar.gz
  • Upload date:
  • Size: 60.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.3

File hashes

Hashes for snip_dedup-0.0.4.tar.gz
Algorithm Hash digest
SHA256 c45f7c4474c1ce75e61908f186ccd5a3b9a3b1f62a17a8cdf7b57db8066c5a85
MD5 2fa5858b799288cf09d071ad79bd2a62
BLAKE2b-256 6b05a8b0d15cbc60b11b9e4ea7733d770120cd2b384765e9251776bc10f42f05

See more details on using hashes here.

File details

Details for the file snip_dedup-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: snip_dedup-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.3

File hashes

Hashes for snip_dedup-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3ccaafe9b1baacd0f4ef06b9766f1274211636593010ad869de94ea1990ede47
MD5 e7228d0c1dfd230a484bfec8c2b95ee0
BLAKE2b-256 963f573fb197b28f67d7ecdc3908768c98f4a37c23a06e5329d71f749a6eb7d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page