Assess robots.txt constraints and discover extractable public page signals.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

scrapp-taxonomy

Evaluate a URL before scraping it. The package checks robots.txt for permission and then reports what kinds of data are likely extractable from the page — headings, links, images, feeds, article URLs, JSON-LD types, forms — all with counts and samples.

Zero external dependencies. Pure Python ≥ 3.11.

Get Started & Documentation


Install	uv, pip, or run without installing via `uvx`
CLI quickstart	`assess` command, flags, and JSON output
Library API	`build_service()`, formatters, custom signals
Docker	Dev build and production hardening flags
Architecture	Layer diagram and extension points
Development	Local setup, linting, tests, coverage
CI/CD pipeline	What runs on each push and on version tags
Releasing	How to cut a release and publish to PyPI

Examples:

# assess a news site and get a text report
scrapp-taxonomy assess https://cnnespanol.cnn.com/colombia

# pipe the JSON output into jq to extract a specific signal count
scrapp-taxonomy assess https://www.eltiempo.com --output json \
  | jq '.page_taxonomy.candidates[] | select(.kind == "article_links") | .count'

# run without installing anything
uvx scrapp-taxonomy assess https://www.bbc.com/mundo --output json

# use the library in your own scraper
python - <<'EOF'
from scrapp_taxonomy import build_service
result = build_service().assess("https://cnnespanol.cnn.com/colombia")
print(result.robots_policy.target_allowed)
print([c.kind for c in result.page_taxonomy.candidates])
EOF

Install

# uv (recommended)
uv add scrapp-taxonomy

# pip
pip install scrapp-taxonomy

# try it without installing
uvx scrapp-taxonomy assess https://cnnespanol.cnn.com/colombia

CLI

Basic usage

scrapp-taxonomy assess https://cnnespanol.cnn.com/colombia

Output:

Target: https://cnnespanol.cnn.com/colombia
Robots: https://cnnespanol.cnn.com/robots.txt (found)
Allowed for scrapp-taxonomy/0.1: yes
Sitemaps:
  - https://cnnespanol.cnn.com/sitemap/index.xml
  - https://cnnespanol.cnn.com/sitemap/news.xml
Page fetch: fetched
Title: Noticias de Colombia hoy: política, elecciones, economía y más | CNN
Language: es
Extractable signals:
  - Headings: 5
  - Links: 285
  - Images: 45
  - JSON-LD structured data: 1
  - Article-like links: 85
Recommendations:
  - The target URL is fetchable for the configured user agent.
  - Use sitemap URLs as the first source for crawl discovery.
  - Initial extractable signals found: Headings, Links, Images, ...

Flags

Flag	Default	Description
`--output`	`text`	`text` or `json`
`--user-agent`	`scrapp-taxonomy/0.1`	HTTP User-Agent string
`--timeout`	`15.0`	Request timeout in seconds
`--log-level`	`WARNING`	`DEBUG`, `INFO`, `WARNING`, `ERROR`

JSON output

scrapp-taxonomy assess https://www.bbc.com/mundo --output json

The JSON structure matches the ScrapeAssessment dataclass exactly and is pipe-friendly:

scrapp-taxonomy assess https://www.eltiempo.com --output json \
  | jq '.page_taxonomy.candidates[] | select(.kind == "article_links") | .count'

Library

Quick start

from scrapp_taxonomy import build_service

service = build_service(user_agent="mybot/1.0", timeout_seconds=10.0)
result = service.assess("https://cnnespanol.cnn.com/colombia")

print(result.robots_policy.target_allowed)  # True
print(result.page_taxonomy.title)           # page title
print(result.page_taxonomy.candidates)      # signal categories with counts
print(result.recommendations)              # prioritised action list

Formatters

Both output formats are available programmatically:

from scrapp_taxonomy import build_service, TextFormatter, JsonFormatter

service = build_service()
result = service.assess("https://cnnespanol.cnn.com/colombia")

print(TextFormatter().format(result))   # same as CLI text output
print(JsonFormatter(indent=2).format(result))  # same as CLI --output json

Custom signal extractors

Signal categories are injected — you can add new ones or remove defaults without touching the package source:

from scrapp_taxonomy import build_service
from scrapp_taxonomy.services.html_analyzer import (
    DEFAULT_SIGNALS,
    SignalSpec,
    StandardHtmlAnalyzer,
    _ParseResult,
)
from scrapp_taxonomy.services.assessment import ScrapeAssessmentService
from scrapp_taxonomy.infrastructure.http import (
    HttpRobotsGateway, HttpPageGateway, UrlLibHttpClient,
)
from scrapp_taxonomy.services.robots import StandardRobotsPolicyReader

# Add a custom signal: detect video embed iframes
video_signal = SignalSpec(
    kind="video_embeds",
    label="Video embeds",
    extract=lambda r: [lnk for lnk in r.links if "youtube" in lnk or "vimeo" in lnk],
)

client = UrlLibHttpClient()
service = ScrapeAssessmentService(
    robots_gateway=HttpRobotsGateway(client),
    page_gateway=HttpPageGateway(client),
    robots_reader=StandardRobotsPolicyReader(),
    analyzer=StandardHtmlAnalyzer(signals=(*DEFAULT_SIGNALS, video_signal)),
)

result = service.assess("https://example.com")

Logging

The package uses logging.getLogger(__name__) throughout. To see internal activity:

import logging
logging.basicConfig(level=logging.DEBUG)

from scrapp_taxonomy import build_service
service = build_service()
result = service.assess("https://cnnespanol.cnn.com/colombia")

Docker

Development

docker build -t scrapp-taxonomy:local .
docker run --rm scrapp-taxonomy:local assess https://cnnespanol.cnn.com/colombia
docker run --rm scrapp-taxonomy:local assess https://www.bbc.com/mundo --output json

Production

The image runs as a non-root user (scrapp, uid 1001). For production workloads, add read-only filesystem and privilege restrictions:

docker run --rm \
  --read-only \
  --security-opt=no-new-privileges \
  --cap-drop=ALL \
  ghcr.io/carlosjimenez88m/scrapp_taxonomy:latest \
  assess https://cnnespanol.cnn.com/colombia --output json

Images are published to GHCR automatically by CI on every push to master and on version tags. Available tags:

Tag	When
`latest`	Every push to `master`
`1.2.3`	When tag `v1.2.3` is pushed
`1.2`	Same
`sha-abc1234`	Every commit

Architecture

src/scrapp_taxonomy/
├── domain/
│   └── models.py      # Immutable dataclasses and enums — no I/O, no imports
├── ports.py           # Protocol interfaces (RobotsGateway, PageGateway, Formatter…)
├── factory.py         # Single wiring point for the object graph (DI entry)
├── formatters.py      # TextFormatter and JsonFormatter implementations
├── infrastructure/
│   └── http.py        # urllib-based HTTP client and gateway adapters
├── services/
│   ├── assessment.py  # Orchestration: robots check → page fetch → recommendations
│   ├── html_analyzer.py # HTML parsing with injectable SignalSpec list
│   └── robots.py      # robots.txt parsing and policy resolution
└── cli.py             # argparse entry point with --log-level support

Each layer only imports from layers below it. factory.py is the one place that wires everything together; nothing else instantiates concrete classes directly.

To plug in a custom HTTP backend (httpx, requests, async), implement RobotsGateway and PageGateway from ports.py and pass them to ScrapeAssessmentService directly.

Development

uv sync --dev              # install dev dependencies
make check                 # lint + type check + tests
make coverage              # tests with HTML coverage report (opens htmlcov/)
make fmt                   # auto-format with ruff
make build                 # build wheel and sdist in dist/
make docker-build          # build Docker image locally
make docker-run URL=https://cnnespanol.cnn.com/colombia  # run against a URL
make pre-commit-install    # install git hooks

Coverage is measured on every CI run and must stay above 80%.

CI/CD pipeline

All delivery steps run in a single workflow (.github/workflows/pipeline.yml):

quality (Python 3.11, 3.12, 3.13)
    lint → type check → test + coverage gate (≥ 80%)
         ↓
    build-dist            docker
    wheel + sdist         multi-arch image (amd64 + arm64)
    smoke tests           push to GHCR on non-PR
         ↓
    publish (v* tags only)
    PyPI via Trusted Publishing

Releasing

Releases are triggered by a version tag:

git tag -a v0.1.1 -m "v0.1.1"
git push origin v0.1.1

The pipeline builds the distributions, smoke-tests them in an isolated environment, and publishes to PyPI using Trusted Publishing — no API token stored in GitHub secrets.

One-time setup before the first release:

GitHub → repo Settings → Environments → New environment → name: pypi
PyPI → your project → Settings → Trusted Publishers → add this repo and workflow

Scope

This tool checks technical signals — it does not substitute for reading a website's terms of service, understanding copyright restrictions, or complying with applicable data-protection regulations. robots.txt is treated as the first boundary for respectful crawling, not the only one.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

carlos_jimenez88

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jun 24, 2026

This version

0.1.0

Jun 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapp_taxonomy-0.1.0.tar.gz (61.0 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapp_taxonomy-0.1.0-py3-none-any.whl (22.7 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file scrapp_taxonomy-0.1.0.tar.gz.

File metadata

Download URL: scrapp_taxonomy-0.1.0.tar.gz
Upload date: Jun 24, 2026
Size: 61.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapp_taxonomy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9014608d98048f2f293d7cf8a68c93d60d13185c7c40ffa9205909a8c125044f`
MD5	`37ddfe565d34658fc69db867f8a249f4`
BLAKE2b-256	`9161819e5a85ab488fe7b27075fa0c34f25072d66d9deff0d3cd11aed9b92221`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapp_taxonomy-0.1.0.tar.gz:

Publisher: pipeline.yml on carlosjimenez88M/scrapp_taxonomy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapp_taxonomy-0.1.0.tar.gz
- Subject digest: 9014608d98048f2f293d7cf8a68c93d60d13185c7c40ffa9205909a8c125044f
- Sigstore transparency entry: 1934466079
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: carlosjimenez88M/scrapp_taxonomy@b04be0ff35421413089c5afe6ba6d73e54dabd1a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/carlosjimenez88M
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pipeline.yml@b04be0ff35421413089c5afe6ba6d73e54dabd1a
- Trigger Event: push

File details

Details for the file scrapp_taxonomy-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapp_taxonomy-0.1.0-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 22.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapp_taxonomy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c26c88d5387d31a0201fc895addc1c9b5f908c1bcea7c3d710104dc06218d2be`
MD5	`6a9c6c69cde08adcbcb4a31e3dd3d1cd`
BLAKE2b-256	`a46426a6090450aa253c70edf81f92da5126153757044e7f322885b2b835dab8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapp_taxonomy-0.1.0-py3-none-any.whl:

Publisher: pipeline.yml on carlosjimenez88M/scrapp_taxonomy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapp_taxonomy-0.1.0-py3-none-any.whl
- Subject digest: c26c88d5387d31a0201fc895addc1c9b5f908c1bcea7c3d710104dc06218d2be
- Sigstore transparency entry: 1934466090
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: carlosjimenez88M/scrapp_taxonomy@b04be0ff35421413089c5afe6ba6d73e54dabd1a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/carlosjimenez88M
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pipeline.yml@b04be0ff35421413089c5afe6ba6d73e54dabd1a
- Trigger Event: push

scrapp-taxonomy 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

scrapp-taxonomy

Get Started & Documentation

Install

CLI

Basic usage

Flags

JSON output

Library

Quick start

Formatters

Custom signal extractors

Logging

Docker

Development

Production

Architecture

Development

CI/CD pipeline

Releasing

Scope

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance