Assess robots.txt constraints and discover extractable public page signals.
Project description
scrapp-taxonomy
Evaluate a URL before scraping it. The package checks robots.txt for permission and then reports what kinds of data are likely extractable from the page — headings, links, images, feeds, article URLs, JSON-LD types, forms — all with counts and samples.
Zero external dependencies. Pure Python ≥ 3.11.
Get Started & Documentation
| Install | uv, pip, or run without installing via uvx |
| CLI quickstart | assess command, flags, and JSON output |
| Library API | build_service(), formatters, custom signals |
| Docker | Dev build and production hardening flags |
| Architecture | Layer diagram and extension points |
| Development | Local setup, linting, tests, coverage |
| CI/CD pipeline | What runs on each push and on version tags |
| Releasing | How to cut a release and publish to PyPI |
Examples:
# assess a news site and get a text report
scrapp-taxonomy assess https://cnnespanol.cnn.com/colombia
# pipe the JSON output into jq to extract a specific signal count
scrapp-taxonomy assess https://www.eltiempo.com --output json \
| jq '.page_taxonomy.candidates[] | select(.kind == "article_links") | .count'
# run without installing anything
uvx scrapp-taxonomy assess https://www.bbc.com/mundo --output json
# use the library in your own scraper
python - <<'EOF'
from scrapp_taxonomy import build_service
result = build_service().assess("https://cnnespanol.cnn.com/colombia")
print(result.robots_policy.target_allowed)
print([c.kind for c in result.page_taxonomy.candidates])
EOF
Install
# uv (recommended)
uv add scrapp-taxonomy
# pip
pip install scrapp-taxonomy
# try it without installing
uvx scrapp-taxonomy assess https://cnnespanol.cnn.com/colombia
CLI
Basic usage
scrapp-taxonomy assess https://cnnespanol.cnn.com/colombia
Output:
Target: https://cnnespanol.cnn.com/colombia
Robots: https://cnnespanol.cnn.com/robots.txt (found)
Allowed for scrapp-taxonomy/0.1: yes
Sitemaps:
- https://cnnespanol.cnn.com/sitemap/index.xml
- https://cnnespanol.cnn.com/sitemap/news.xml
Page fetch: fetched
Title: Noticias de Colombia hoy: política, elecciones, economía y más | CNN
Language: es
Extractable signals:
- Headings: 5
- Links: 285
- Images: 45
- JSON-LD structured data: 1
- Article-like links: 85
Recommendations:
- The target URL is fetchable for the configured user agent.
- Use sitemap URLs as the first source for crawl discovery.
- Initial extractable signals found: Headings, Links, Images, ...
Flags
| Flag | Default | Description |
|---|---|---|
--output |
text |
text or json |
--user-agent |
scrapp-taxonomy/0.1 |
HTTP User-Agent string |
--timeout |
15.0 |
Request timeout in seconds |
--log-level |
WARNING |
DEBUG, INFO, WARNING, ERROR |
JSON output
scrapp-taxonomy assess https://www.bbc.com/mundo --output json
The JSON structure matches the ScrapeAssessment dataclass exactly and is pipe-friendly:
scrapp-taxonomy assess https://www.eltiempo.com --output json \
| jq '.page_taxonomy.candidates[] | select(.kind == "article_links") | .count'
Library
Quick start
from scrapp_taxonomy import build_service
service = build_service(user_agent="mybot/1.0", timeout_seconds=10.0)
result = service.assess("https://cnnespanol.cnn.com/colombia")
print(result.robots_policy.target_allowed) # True
print(result.page_taxonomy.title) # page title
print(result.page_taxonomy.candidates) # signal categories with counts
print(result.recommendations) # prioritised action list
Formatters
Both output formats are available programmatically:
from scrapp_taxonomy import build_service, TextFormatter, JsonFormatter
service = build_service()
result = service.assess("https://cnnespanol.cnn.com/colombia")
print(TextFormatter().format(result)) # same as CLI text output
print(JsonFormatter(indent=2).format(result)) # same as CLI --output json
Custom signal extractors
Signal categories are injected — you can add new ones or remove defaults without touching the package source:
from scrapp_taxonomy import build_service
from scrapp_taxonomy.services.html_analyzer import (
DEFAULT_SIGNALS,
SignalSpec,
StandardHtmlAnalyzer,
_ParseResult,
)
from scrapp_taxonomy.services.assessment import ScrapeAssessmentService
from scrapp_taxonomy.infrastructure.http import (
HttpRobotsGateway, HttpPageGateway, UrlLibHttpClient,
)
from scrapp_taxonomy.services.robots import StandardRobotsPolicyReader
# Add a custom signal: detect video embed iframes
video_signal = SignalSpec(
kind="video_embeds",
label="Video embeds",
extract=lambda r: [lnk for lnk in r.links if "youtube" in lnk or "vimeo" in lnk],
)
client = UrlLibHttpClient()
service = ScrapeAssessmentService(
robots_gateway=HttpRobotsGateway(client),
page_gateway=HttpPageGateway(client),
robots_reader=StandardRobotsPolicyReader(),
analyzer=StandardHtmlAnalyzer(signals=(*DEFAULT_SIGNALS, video_signal)),
)
result = service.assess("https://example.com")
Logging
The package uses logging.getLogger(__name__) throughout. To see internal activity:
import logging
logging.basicConfig(level=logging.DEBUG)
from scrapp_taxonomy import build_service
service = build_service()
result = service.assess("https://cnnespanol.cnn.com/colombia")
Docker
Development
docker build -t scrapp-taxonomy:local .
docker run --rm scrapp-taxonomy:local assess https://cnnespanol.cnn.com/colombia
docker run --rm scrapp-taxonomy:local assess https://www.bbc.com/mundo --output json
Production
The image runs as a non-root user (scrapp, uid 1001). For production workloads, add read-only filesystem and privilege restrictions:
docker run --rm \
--read-only \
--security-opt=no-new-privileges \
--cap-drop=ALL \
ghcr.io/carlosjimenez88m/scrapp_taxonomy:latest \
assess https://cnnespanol.cnn.com/colombia --output json
Images are published to GHCR automatically by CI on every push to master and on version tags. Available tags:
| Tag | When |
|---|---|
latest |
Every push to master |
1.2.3 |
When tag v1.2.3 is pushed |
1.2 |
Same |
sha-abc1234 |
Every commit |
Architecture
src/scrapp_taxonomy/
├── domain/
│ └── models.py # Immutable dataclasses and enums — no I/O, no imports
├── ports.py # Protocol interfaces (RobotsGateway, PageGateway, Formatter…)
├── factory.py # Single wiring point for the object graph (DI entry)
├── formatters.py # TextFormatter and JsonFormatter implementations
├── infrastructure/
│ └── http.py # urllib-based HTTP client and gateway adapters
├── services/
│ ├── assessment.py # Orchestration: robots check → page fetch → recommendations
│ ├── html_analyzer.py # HTML parsing with injectable SignalSpec list
│ └── robots.py # robots.txt parsing and policy resolution
└── cli.py # argparse entry point with --log-level support
Each layer only imports from layers below it. factory.py is the one place that wires everything together; nothing else instantiates concrete classes directly.
To plug in a custom HTTP backend (httpx, requests, async), implement RobotsGateway and PageGateway from ports.py and pass them to ScrapeAssessmentService directly.
Development
uv sync --dev # install dev dependencies
make check # lint + type check + tests
make coverage # tests with HTML coverage report (opens htmlcov/)
make fmt # auto-format with ruff
make build # build wheel and sdist in dist/
make docker-build # build Docker image locally
make docker-run URL=https://cnnespanol.cnn.com/colombia # run against a URL
make pre-commit-install # install git hooks
Coverage is measured on every CI run and must stay above 80%.
CI/CD pipeline
All delivery steps run in a single workflow (.github/workflows/pipeline.yml):
quality (Python 3.11, 3.12, 3.13)
lint → type check → test + coverage gate (≥ 80%)
↓
build-dist docker
wheel + sdist multi-arch image (amd64 + arm64)
smoke tests push to GHCR on non-PR
↓
publish (v* tags only)
PyPI via Trusted Publishing
Releasing
Releases are triggered by a version tag:
git tag -a v0.1.1 -m "v0.1.1"
git push origin v0.1.1
The pipeline builds the distributions, smoke-tests them in an isolated environment, and publishes to PyPI using Trusted Publishing — no API token stored in GitHub secrets.
One-time setup before the first release:
- GitHub → repo Settings → Environments → New environment → name:
pypi - PyPI → your project → Settings → Trusted Publishers → add this repo and workflow
Scope
This tool checks technical signals — it does not substitute for reading a website's terms of service, understanding copyright restrictions, or complying with applicable data-protection regulations. robots.txt is treated as the first boundary for respectful crawling, not the only one.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapp_taxonomy-0.1.1.tar.gz.
File metadata
- Download URL: scrapp_taxonomy-0.1.1.tar.gz
- Upload date:
- Size: 60.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f50349a31e1ae26f39882c31c5a59cb0da0dcacfae2580ab4ea94356b9a53d22
|
|
| MD5 |
506abef99b50c917c11abd80c4c939f0
|
|
| BLAKE2b-256 |
67ee38d7ed6e0acc35b42c18cdcddd93cfff807deefc6a7dd5b0add2a7eec702
|
Provenance
The following attestation bundles were made for scrapp_taxonomy-0.1.1.tar.gz:
Publisher:
pipeline.yml on carlosjimenez88M/scrapp_taxonomy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapp_taxonomy-0.1.1.tar.gz -
Subject digest:
f50349a31e1ae26f39882c31c5a59cb0da0dcacfae2580ab4ea94356b9a53d22 - Sigstore transparency entry: 1934519175
- Sigstore integration time:
-
Permalink:
carlosjimenez88M/scrapp_taxonomy@6be7f9cb4c9a6e016cc71cba6921ca13a6bac961 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/carlosjimenez88M
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pipeline.yml@6be7f9cb4c9a6e016cc71cba6921ca13a6bac961 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scrapp_taxonomy-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scrapp_taxonomy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c76b42f594a4850bdefe3cba7587c65f3e8b1abfe2d62898144d580dc7ef63ab
|
|
| MD5 |
a4d93715cb174c1610f4e5604e6be990
|
|
| BLAKE2b-256 |
ee4029d1a47abb8e30193c796b574e8fb7efa140bb3f9cb5771914d435bb17eb
|
Provenance
The following attestation bundles were made for scrapp_taxonomy-0.1.1-py3-none-any.whl:
Publisher:
pipeline.yml on carlosjimenez88M/scrapp_taxonomy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapp_taxonomy-0.1.1-py3-none-any.whl -
Subject digest:
c76b42f594a4850bdefe3cba7587c65f3e8b1abfe2d62898144d580dc7ef63ab - Sigstore transparency entry: 1934519233
- Sigstore integration time:
-
Permalink:
carlosjimenez88M/scrapp_taxonomy@6be7f9cb4c9a6e016cc71cba6921ca13a6bac961 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/carlosjimenez88M
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pipeline.yml@6be7f9cb4c9a6e016cc71cba6921ca13a6bac961 -
Trigger Event:
push
-
Statement type: