Skip to main content

Build labeled image datasets from a plain-English prompt.

Project description

prompt2dataset

Build labeled image datasets from a plain-English prompt.

$ cd my-dataset
$ p2d add
What image dataset do you want to build? > bird species native to the Pacific Northwest

prompt2dataset resolves your description into subjects via Claude, fetches images from one or more sources, deduplicates, downloads, and writes a manifest.

Installation

pip install prompt2dataset

p2d add, review, and info work with this base install. Training requires PyTorch. Install the CPU or CUDA extras depending on your hardware:

pip install "prompt2dataset[train]"       # CPU
pip install "prompt2dataset[train-cuda]"  # CUDA (installs matching torch/torchvision)

Setup

prompt2dataset needs an Anthropic API key. On first run it will prompt you and save the key to a local .env file. Or set it yourself:

# .env
ANTHROPIC_API_KEY=sk-ant-...
P2D_CONTACT=you@example.com   # included in API request headers per Wikimedia's policy

Usage

All commands operate on the current directory.

p2d add

Prompts for a dataset description, resolves subjects, and downloads images. Run it again in the same directory to fetch additional subjects without re-downloading what's already there.

$ mkdir pacific-northwest-birds && cd pacific-northwest-birds
$ p2d add

p2d review

Step through downloaded images and mark them valid or delete them.

$ p2d review
$ p2d review --misclassified   # only images that a trained model got wrong

Keys: A accept, D delete, S skip, Q quit.

p2d info

Print dataset statistics and the subject list.

p2d train

Fine-tune a pretrained image classifier on the dataset. Uses torch-lr-finder to find a good learning rate automatically, then trains for N epochs and exports a TorchScript model.

$ p2d train
$ p2d train --model resnet50 --epochs 10

Options: --epochs, --val-split, --img-size, --model (mobilenet_v2, resnet18, resnet50).

Data sources

Source Best for
DuckDuckGo Broad or niche subjects, recent events, pop culture
Wikimedia Commons Well-documented subjects with Wikipedia articles
iNaturalist Animals, plants, fungi - research-grade, taxonomy-tagged
Openverse General subjects, scenes, cultural content

None require an API key. Sources are selected interactively when you run p2d add.

Output layout

my-dataset/
  american-robin/
    american-robin_a3f1c8d2e9b4.jpg
    ...
  stellers-jay/
    ...
  .p2d/
    manifest.json       dataset metadata and item list
    labels.csv          filename, subject, source
    subjects.json       resolved subject list (cached)
    model.pt            TorchScript model (after p2d train)
    labels.json         class names in output order
    report.json         per-class precision/recall/F1
    misclassified.json  validation images the model got wrong

manifest.json is the authoritative record. Everything in .p2d/ is generated and can be reconstructed.

Global flag

--debug enables verbose logging for all commands:

p2d --debug add

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt2dataset-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prompt2dataset-0.1.0-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file prompt2dataset-0.1.0.tar.gz.

File metadata

  • Download URL: prompt2dataset-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for prompt2dataset-0.1.0.tar.gz
Algorithm Hash digest
SHA256 16f295ee3fbf184ef4bf2d7030eb250760950bfdbf07d03bc958f3633bf1091b
MD5 b147e83c547cd18d2cbccfda123c3ae6
BLAKE2b-256 76d716dba4f9e96cd347cdc7ace168b3d0287d5b18abc0cb0680768d6716251f

See more details on using hashes here.

File details

Details for the file prompt2dataset-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: prompt2dataset-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for prompt2dataset-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 311178c84030dac6041dc979ff261fb38467bb1ff64b9d5a504a09b2ab989e93
MD5 f00b73eab56fd821f90838a98f674dbb
BLAKE2b-256 9d6f70939455cecdaae899f379f9c636521301eeac83fdac7f89260e50d7cd98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page