Self-hosted web scraping and Markdown extraction for AI agents.

These details have not been verified by PyPI

Project links

Project description

AgentCrawl

AgentCrawl README hero

🕷️ Self-hosted web extraction for AI agents.

AgentCrawl turns web pages and local documents into clean Markdown, text, links, metadata, JSON-LD, and crawl results that agents can actually use. Run it as a CLI, Python library, HTTP API, Docker service, or MCP server. Your cache, jobs, retries, failures, and extracted data stay in your environment.

pip install agentcrawl-ai
agentcrawl scrape https://example.com

Pick your path 🚀

For agents: MCP

python -m pip install "agentcrawl-ai[browser]"
agentcrawl doctor
agentcrawl mcp

MCP tools cover scrape_url, map_site, crawl_site, job status, cancellation, event history, failure inspection, selective retries, usage, and cache control. Coding agents should follow INSTALL_FOR_AGENTS.md.

For developers: Python + CLI

pip install agentcrawl-ai
agentcrawl scrape https://example.com

from agentcrawl import AgentCrawl

crawler = AgentCrawl({"fetcher": "http"})
document = crawler.scrape("https://example.com")

print(document.markdown)
print(document.metadata)

For servers: Docker + API

docker run --rm -p 8000:8000 \
  -e AGENTCRAWL_API_KEYS="replace-with-a-long-random-key" \
  ghcr.io/jorg18/agentcrawl:latest

curl http://127.0.0.1:8000/health

Or with Compose:

cp .env.example .env
# Replace AGENTCRAWL_API_KEYS and AGENTCRAWL_API_KEY in .env
docker compose up -d
curl http://127.0.0.1:8000/health

Why AgentCrawl? ✨

Agents need fresh web context, but raw HTML is noisy and one-off scraper scripts age badly. AgentCrawl gives them a reliable extraction layer with the operational pieces already built in:

🎯 Known URL in, clean Markdown out — main-content extraction, table preservation, fenced code blocks, links, metadata, and provenance.
⚡ HTTP first — fast default extraction without a browser runtime.
🧱 Durable crawls — SQLite-backed jobs with checkpoints, pagination, cancellation, events, retries, and failure inspection.
🗄️ Local state — cache, usage, jobs, events, crawl failures, and extracted documents stay in your environment.
🔐 Safer API defaults — bearer auth, robots.txt support, SSRF protections, unsafe redirect blocking, and private-network controls.
🤖 Agent-native interfaces — CLI, Python, HTTP API, Docker, and MCP.

What Community includes

AgentCrawl Community is the self-hosted trust layer:

Included	Notes
CLI	Scrape, crawl, inspect jobs, manage cache, backup, restore.
Python library	Local use from scripts and agent runtimes.
HTTP API	FastAPI server for self-hosted deployments.
MCP	Standards-based stdio MCP server for agent clients.
Docker / GHCR	Public image built and smoke-tested by GitHub Actions.
Durable crawls	SQLite jobs, events, checkpoints, retries, and failure records.
Quality extraction	Markdown, links, metadata, JSON-LD/provenance, tables, code blocks.
Basic browser fallback	Optional local browser/Camofox path, not required for the default image.
Lightweight docs	Install, examples, operations, release, quality notes.

Community is self-hosted. A managed hosted AgentCrawl service is planned later for users who want managed browsers, proxies, schedules, webhooks, datasets, teams, billing, and enterprise controls.

Extraction quality 🧹

The Community engine focuses on stable, agent-ready Markdown before benchmark claims:

selects semantic content from <main>, <article>, documentation/content containers, or text-rich fallback blocks;
removes unsafe and noisy page chrome such as scripts, styles, hidden content, nav, footer, cookie banners, sidebars, and related-post blocks;
preserves Markdown tables with headers and cell values;
preserves fenced code blocks and language tags from common classes such as language-python and lang-javascript;
attaches extraction provenance such as source/final URL, selected content hint, selection score, candidate count, content hash, extraction strategy, JSON-LD/schema fields, and output size/structure metadata;
validates extraction quality against checked-in fixtures with a minimum score threshold.

Run the report locally:

python benchmarks/quality_report.py

HTTP API 🌐

Authentication is enabled by default. Configure at least one API key before exposing the server:

export AGENTCRAWL_API_KEYS="replace-with-a-long-random-key"
python -m pip install "agentcrawl-ai[browser]"
agentcrawl serve --host 0.0.0.0 --port 8000

Health check:

curl http://127.0.0.1:8000/health

Scrape a URL:

curl http://127.0.0.1:8000/v1/scrape \
  -H "authorization: Bearer replace-with-a-long-random-key" \
  -H "content-type: application/json" \
  -d '{"url":"https://example.com","formats":["markdown","links","metadata"]}'

Main endpoints:

GET    /health
POST   /v1/scrape
POST   /v1/map
POST   /v1/crawl
GET    /v1/jobs/{job_id}
GET    /v1/jobs/{job_id}/events
DELETE /v1/jobs/{job_id}
GET    /v1/failures
GET    /v1/jobs/{job_id}/failures
POST   /v1/jobs/{job_id}/failures/retry
POST   /v1/extract
GET    /v1/usage
GET    /v1/stats
DELETE /v1/cache

OpenAPI docs are available at /docs when the server is running.

Crawl jobs 🧭

Start an asynchronous crawl:

agentcrawl --remote crawl https://example.com --max-pages 25 --max-depth 2

HTTP clients can attach an idempotency key so retries return the original job instead of starting a duplicate:

curl http://127.0.0.1:8000/v1/crawl \
  -H "authorization: Bearer replace-with-a-long-random-key" \
  -H "content-type: application/json" \
  -H "Idempotency-Key: docs-crawl-2026-06-06" \
  -d '{"url":"https://example.com","max_pages":25,"max_depth":2}'

Running jobs checkpoint their queue, visited URLs, retry attempts, progress, and extracted documents in SQLite. Transient page failures use persisted exponential backoff without occupying a crawl worker. They are reclaimed after a service restart.

Read completed documents page by page:

agentcrawl --remote job JOB_ID --offset 0 --limit 100

Inspect or cancel a job:

agentcrawl --remote job JOB_ID
agentcrawl --remote job-cancel JOB_ID

/v1/stats reports queue readiness, delayed retries, running and cancelling jobs, crawl failures by status, open retryable failures, and open failures by error type.

Local documents 📄

Community supports local document ingestion without sending file contents to a hosted parser:

agentcrawl scrape ./notes.md
agentcrawl scrape ./data.json
agentcrawl scrape ./feed.xml
python -m pip install "agentcrawl-ai[browser]"
agentcrawl scrape ./report.pdf

Current document support:

Input	Support
HTML	Main-content Markdown extraction.
Markdown	Passed through as Markdown.
Text	Passed through as plain Markdown text.
JSON	Pretty-printed inside a fenced `json` block.
XML/RSS/Atom	Preserved inside a fenced `xml` block.
PDF	Extracted page-by-page to Markdown with the optional `docs` extra. Enforces size/page safety limits and rejects encrypted PDFs.

Browser rendering

The default package and default Docker image use HTTP extraction. Add browser rendering only when a site needs JavaScript:

python -m pip install "agentcrawl-ai[browser]"
playwright install chromium

AgentCrawl also supports an optional external Camofox REST backend:

export AGENTCRAWL_BROWSER_BACKEND=camofox
export AGENTCRAWL_CAMOFOX_URL=http://127.0.0.1:9377
export AGENTCRAWL_CAMOFOX_ACCESS_KEY=replace-if-access-control-is-enabled

Cache ⚡

Disable cache for one scrape or choose a TTL of up to 30 days:

{"url":"https://example.com","cache":false}

{"url":"https://example.com","cache_ttl_seconds":3600}

Clear all cache entries or filter by domain or exact URL:

agentcrawl --remote cache-clear
agentcrawl --remote cache-clear --domain example.com
agentcrawl --remote cache-clear --url https://example.com/page

Backups 💾

Use SQLite online backup before deployment or migration:

agentcrawl backup --db agentcrawl.db --output-dir ./backups

Pass --env-file to copy a protected environment file into the backup directory without printing secret values. Restore refuses to overwrite an existing database unless --force is provided and verifies the backup before copying:

agentcrawl restore --backup-db ./backups/agentcrawl-YYYYMMDD-HHMMSS.db --db agentcrawl.db --force

Security defaults

The HTTP server rejects local file paths, localhost, private networks, non-HTTP schemes, embedded URL credentials, and redirects to non-global addresses. Local files remain available through the Python library.

Do not expose the API without authentication, TLS, request limits, and network controls. See SECURITY.md and docs/OPERATIONS.md.

Docs you'll actually use 📚

Install for agents: canonical setup flow for coding agents.
Examples: copy-paste workflows for CLI, Python, HTTP, MCP, Docker, and agents.
Quality benchmarks: how extraction quality is measured and reported.
Operations: deployment, backup, restore, and production checks.
Release checklist: PyPI/GHCR release validation and smoke tests.
Comparison: choose between AgentCrawl, Firecrawl, Crawl4AI, ScrapeGraphAI, Jina Reader, Crawlee, and Stagehand.

Optional LLM extraction

AgentCrawl Community does not require an LLM for scraping, crawling, API, Docker, or MCP usage. The legacy prompt-driven AgentCrawler.extract() path is optional: install agentcrawl-ai[llm] and configure llm or llm_model before using it. Built-in web search is disabled by default; keep search in your agent/provider layer unless you explicitly configure a search backend.

Development

pip install -e ".[server,mcp,llm,dev]"
pytest -q
ruff check agentcrawl tests examples benchmarks

Roadmap

See ROADMAP.md.

License

AgentCrawl Community is licensed under Apache License 2.0. Commercial modules and hosted services are separate products and are not included in this repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jun 28, 2026

0.1.2

Jun 28, 2026

0.1.1

Jun 27, 2026

This version

0.1.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentcrawl_ai-0.1.0.tar.gz (3.5 MB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentcrawl_ai-0.1.0-py3-none-any.whl (61.0 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file agentcrawl_ai-0.1.0.tar.gz.

File metadata

Download URL: agentcrawl_ai-0.1.0.tar.gz
Upload date: Jun 11, 2026
Size: 3.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for agentcrawl_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`efecf24db21f98ab8f67cfe3e207ee89ab1a3c068c91ad39a83171e4b8b3f017`
MD5	`46d86b7976cbbdb1fd175bb001e5838a`
BLAKE2b-256	`fe25b5055a41ef13b505afbe35d51b280da3f6bce331b7aee2a334e09ef1fceb`

See more details on using hashes here.

File details

Details for the file agentcrawl_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentcrawl_ai-0.1.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 61.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for agentcrawl_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af359e5eb60884236cee5a32f630860f0d05bbf312110211e428f890e7e888ed`
MD5	`9e0e9c0cfea892e1238ad90ab211d53a`
BLAKE2b-256	`0619e0198b32116ed857d60deef312c37fb506bcf6545b45d261784df92db5a9`

See more details on using hashes here.

agentcrawl-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentCrawl

Pick your path 🚀

For agents: MCP

For developers: Python + CLI

For servers: Docker + API

Why AgentCrawl? ✨

What Community includes

Extraction quality 🧹

HTTP API 🌐

Crawl jobs 🧭

Local documents 📄

Browser rendering

Cache ⚡

Backups 💾

Security defaults

Docs you'll actually use 📚

Optional LLM extraction

Development

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes