Skip to main content

Respect generative AI opt-outs in your ML and training pipeline.

Project description

ReadTheDocs PyPI-Server PyPI - License Twitter

datadiligence

Respect generative AI opt-outs in your ML training pipeline.

datadiligence aims to make it simple for ML practitioners to respect opt-outs in their training by providing a consistent interface to check if a given work is opted-out using any known method. The goal of this project is to make respecting opt-outs as painless as possible, while being flexible enough to support new opt-out methods as they are developed.

Why is this needed?

ML training datasets are often harvested without consent from the data or content owners, meaning any ML models trained with these datasets could be violating the wishes of content creators on how their content is used. With the absence of an opt-out standard, many platforms and individuals have come up with their own methods of stating their consent.

Additionally, consent can change over time, and static datasets obviously cannot. A work which was consenting at the time of the dataset’s creation may not be consenting at the time of training. Keeping up with the current state of opt-outs is unrealistic for most practitioners, and so this project aims to make it as easy as possible to respect opt-outs in your training pipeline.

Basic Usage

To install:

pip install datadiligence

Add bulk pre-processing for URLs in your pipeline (requires Spawning API Key):

>>> import datadiligence as dd
>>> urls = ["https://www.example.com/art-123456789.jpg", "https://www.example.com/art-987654321.jpg"]
>>> dd.filter_allowed(urls=urls)
 ["https://www.example.com/art-123456789.jpg"]
>>> dd.is_allowed(urls=urls)
 [True, False]

Check HTTP responses in post-processing:

>>> response = requests.get("https://www.example.com/art-123456789.jpg")
>>> is_allowed = dd.is_allowed(response=response)
True
>>> if is_allowed:
>>>     process_image(response.content)

Full documentation is available on readthedocs.

Check a local file:

>>> dd.is_allowed(path="path/to/file.jpg")
False

Respected Opt-Out Methods

This project currently supports the following opt-out methods:

Contributing

See contribution guidelines here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadiligence-0.1.7.tar.gz (142.5 kB view details)

Uploaded Source

Built Distribution

datadiligence-0.1.7-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file datadiligence-0.1.7.tar.gz.

File metadata

  • Download URL: datadiligence-0.1.7.tar.gz
  • Upload date:
  • Size: 142.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for datadiligence-0.1.7.tar.gz
Algorithm Hash digest
SHA256 10c8311752794905eca4133724bec48aad08e6d9da59209a59bc55e7cd9dfe0b
MD5 42bd4cfb4334ad6a0c8037f6605f2d53
BLAKE2b-256 a70364165853e04c0e655a32466cea3415df4674bd938511ccd914d1be4e4a7a

See more details on using hashes here.

File details

Details for the file datadiligence-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for datadiligence-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7b8f438f98b1e208ada25cab252bae0a5c90cfb6526df91c91c89efa25338c55
MD5 6bd2de3e22c62eb17461086244aa5ddf
BLAKE2b-256 3e93578e43b409c95196a066300f15aa13d5014c0a2e43b824dad2c5ea98103a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page