Respect generative AI opt-outs in your ML and training pipeline.
Project description
datadiligence
Respect generative AI opt-outs in your ML training pipeline.
datadiligence aims to make it simple for ML practitioners to respect opt-outs in their training by providing a consistent interface to check if a given work is opted-out using any known method. The goal of this project is to make respecting opt-outs as painless as possible, while being flexible enough to support new opt-out methods as they are developed.
Why is this needed?
ML training datasets are often harvested without consent from the data or content owners, meaning any ML models trained with these datasets could be violating the wishes of content creators on how their content is used. With the absence of an opt-out standard, many platforms and individuals have come up with their own methods of stating their consent.
Additionally, consent can change over time, and static datasets obviously cannot. A work which was consenting at the time of the dataset’s creation may not be consenting at the time of training. Keeping up with the current state of opt-outs is unrealistic for most practitioners, and so this project aims to make it as easy as possible to respect opt-outs in your training pipeline.
Basic Usage
To install:
pip install datadiligence
Add bulk pre-processing for URLs in your pipeline (requires Spawning API Key):
>>> import datadiligence as dd >>> urls = ["https://www.example.com/art-123456789.jpg", "https://www.example.com/art-987654321.jpg"] >>> dd.filter_allowed(urls=urls) ["https://www.example.com/art-123456789.jpg"] >>> dd.is_allowed(urls=urls) [True, False]
Check HTTP responses in post-processing:
>>> response = requests.get("https://www.example.com/art-123456789.jpg") >>> is_allowed = dd.is_allowed(response=response) True >>> if is_allowed: >>> process_image(response.content)
Full documentation is available on readthedocs.
Check a local file:
>>> dd.is_allowed(path="path/to/file.jpg") False
Respected Opt-Out Methods
This project currently supports the following opt-out methods:
The Spawning API. See https://spawning.ai/api for more information.
The DeviantArt X-Robots-Tag HTTP Headers. See https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371 for more information.
C2PA/CAI metadata. See https://c2pa.org/ for more information.
Contributing
See contribution guidelines here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datadiligence-0.1.7.tar.gz
.
File metadata
- Download URL: datadiligence-0.1.7.tar.gz
- Upload date:
- Size: 142.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10c8311752794905eca4133724bec48aad08e6d9da59209a59bc55e7cd9dfe0b |
|
MD5 | 42bd4cfb4334ad6a0c8037f6605f2d53 |
|
BLAKE2b-256 | a70364165853e04c0e655a32466cea3415df4674bd938511ccd914d1be4e4a7a |
File details
Details for the file datadiligence-0.1.7-py3-none-any.whl
.
File metadata
- Download URL: datadiligence-0.1.7-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b8f438f98b1e208ada25cab252bae0a5c90cfb6526df91c91c89efa25338c55 |
|
MD5 | 6bd2de3e22c62eb17461086244aa5ddf |
|
BLAKE2b-256 | 3e93578e43b409c95196a066300f15aa13d5014c0a2e43b824dad2c5ea98103a |