Skip to main content

CLI-first pipeline for discovering and cataloging coloring-style SVG and PNG assets.

Project description

Image-Scrapling

Public repository for discovering, evaluating, converting, and cataloging coloring-style image assets.

Repository name:

  • Image-Scrapling

Published package and import names:

  • package: svg-scrapling
  • import: svg_scrapling

Bootstrap

Run the standard project commands from the repository root:

uv sync --group dev
uv run ruff check .
uv run ruff format --check .
uv run mypy src apps
uv run pytest

First Real CLI Run

The CLI now assembles a default runtime without manual dependency wiring in code.

Install dependencies first:

uv sync --group dev

If you want PNG-to-SVG conversion through VTracer, install the optional conversion extra:

uv sync --group dev --extra conversion

See the available commands:

uv run assets --help

Start with a real search run:

uv run assets find \
  --query "tygryski do kolorowania" \
  --count 10 \
  --preferred-format svg \
  --fallback-format png \
  --convert-to svg \
  --mode provenance_only \
  --provider duckduckgo_html \
  --run-id demo-run \
  --output ./data/runs

This writes a deterministic run directory under ./data/runs/demo-run.

Useful operational flags:

  • --provider duckduckgo_html or --provider bing_html selects the preferred live discovery provider.
  • non-disabled providers are tried in ordered fallback after the preferred provider
  • --disable-provider ... explicitly blocks a provider for one run.
  • --run-id ... resumes or reuses a stable run directory.
  • --skip-existing-downloads is enabled by default and reuses deterministic asset paths when possible.
  • --fetch-strategy static_first|dynamic_on_failure|dynamic_only controls fetch escalation.

Example with Bing as the preferred provider:

uv run assets find \
  --query "tiger coloring page" \
  --count 5 \
  --preferred-format png \
  --provider bing_html \
  --disable-provider duckduckgo_html \
  --output ./data/runs/live-bing

Example resume run:

uv run assets find \
  --query "tiger coloring page" \
  --count 5 \
  --preferred-format png \
  --run-id demo-run \
  --output ./data/runs

Inspect a manifest after the run:

uv run assets inspect-manifest ./data/runs/demo-run/manifests/manifest.jsonl

Export CSV and Markdown reports:

uv run assets export-report \
  ./data/runs/demo-run/manifests/manifest.jsonl \
  --csv-output ./data/runs/demo-run/manifests/report.csv \
  --markdown-output ./data/runs/demo-run/manifests/report.md

Successful runs now write:

  • manifests/manifest.jsonl as the canonical machine-readable output
  • manifests/summary.json and manifests/summary.txt for operator-facing summaries
  • manifests/rejected_candidates.jsonl for fetch, extraction, policy, and download rejections
  • logs/pipeline.log for stage-level execution details

Dynamic Fetching

Static fetching is the default and should be preferred for normal runs.

If you have a Lightpanda-compatible wrapper, expose it through:

export SVG_SCRAPLING_LIGHTPANDA_CMD="/path/to/lightpanda-wrapper"

The wrapper must support:

<wrapper> fetch <url> <timeout_seconds>

and print JSON to stdout:

{"html":"<html>...</html>","final_url":"https://example.com/final"}

Then you can enable dynamic fallback:

uv run assets find \
  --query "tiger coloring page" \
  --count 5 \
  --fetch-strategy dynamic_on_failure \
  --output ./data/runs/dynamic-demo

Library Usage

Supported library entrypoints are exposed from stable package surfaces:

  • svg_scrapling
  • svg_scrapling.config
  • svg_scrapling.pipeline
  • svg_scrapling.runtime

Example:

from svg_scrapling import (
    FetchStrategy,
    FindAssetsConfig,
    LicenseMode,
    OutputFormat,
    build_default_pipeline_dependencies,
    run_find_assets,
)

config = FindAssetsConfig(
    query="tiger coloring page",
    count=5,
    preferred_format=OutputFormat.SVG,
    fallback_format=OutputFormat.PNG,
    mode=LicenseMode.PROVENANCE_ONLY,
    fetch_strategy=FetchStrategy.STATIC_FIRST,
)

result = run_find_assets(
    config,
    dependencies=build_default_pipeline_dependencies(config),
)

print(result.manifest_path)

Deep internal module imports outside those entrypoints should be treated as unstable.

Versioning

  • distribution version comes from project.version in pyproject.toml
  • runtime version is read from the installed package metadata
  • release tags should use the format vX.Y.Z

Current Limitations

  • Live discovery currently uses duckduckgo_html and bing_html.
  • Static asset downloading now uses conservative provenance-aware request headers, but some hosts may still block media retrieval.
  • Dynamic fetching still fails loudly when no Lightpanda-compatible client is configured.
  • License handling stays conservative: licensed_only requires an explicit allowlist and provenance_only preserves uncertain cases rather than silently allowing reuse.
  • The VTracer conversion backend is currently supported on Python >=3.10,<3.14.
  • Raster-to-SVG conversion is optional and requires installing the conversion extra.

Current Runtime Note

The Python 3.14 compatibility follow-up is tracked in GitHub issue #20.

Reproduction details for the current Python 3.14 blocker live in docs/vtracer-python-314.md.

Developer Workflow

Repository workflow, validation expectations, and run output conventions are documented in docs/developer-workflow.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svg_scrapling-0.1.0.tar.gz (111.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

svg_scrapling-0.1.0-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file svg_scrapling-0.1.0.tar.gz.

File metadata

  • Download URL: svg_scrapling-0.1.0.tar.gz
  • Upload date:
  • Size: 111.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for svg_scrapling-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f48ccab3d81ad838d2870cf1169e5af0a5fea5b996e9c023798fa8bf06b22c8
MD5 1e12cc71af9569dd327033568f742116
BLAKE2b-256 28f04507a95c8346eef91120031e5ff30bb8c40285b939ddb2a2fae342aad742

See more details on using hashes here.

File details

Details for the file svg_scrapling-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: svg_scrapling-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for svg_scrapling-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 92ece5fa84e32bf7f1a2cbe7a245cb9c0fcc528b17271e4253d39f254d5c449d
MD5 7977492246605f9df8ffb09271de819b
BLAKE2b-256 c5f8d8492ac12d19cd41fd8f5a377c50370b84987dee1cf866363ea90660849d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page