Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline
Project description
fetchkit
Advanced web fetching, scraping, and content acquisition toolkit for Python.
From simple HTTP requests to full crawl-scrape-download pipelines backed by Postgres and MinIO.
Installation | Quick Start | CLI | Pipeline | Documentation | Examples
Highlights
pip install fetchkit # Core: fetch, scrape, headers
pip install 'fetchkit[pipeline]' # + Postgres job queue + MinIO storage
pip install 'fetchkit[full]' # Everything including yt-dlp, Playwright, etc.
|
Fetch with realistic browser headers from pyfetcher import fetch
response = fetch("https://example.com")
# Sends Chrome-like headers with Client Hints,
# Sec-Fetch-*, UA rotation automatically
|
Scrape anything from pyfetcher.scrape import (
extract_links, extract_text,
extract_readable_text,
)
links = extract_links(html, base_url=url)
titles = extract_text(html, "h1")
article = extract_readable_text(html)
|
|
4 HTTP backends from pyfetcher import FetchRequest, fetch
# TLS fingerprinting (bypass bot detection)
fetch(FetchRequest(url=url, backend="curl_cffi"))
# Cloudflare bypass
fetch(FetchRequest(url=url, backend="cloudscraper"))
|
Download media with yt-dlp from pyfetcher.downloaders.ytdlp import YtdlpDownloader
dl = YtdlpDownloader()
info = await dl.extract_info(video_url)
results = await dl.download(video_url,
output_dir="./media")
|
Features
Core Library
| Feature | Description |
|---|---|
| Browser Headers | 11 profiles (Chrome/Firefox/Safari/Edge) across 5 platforms. Consistent UA + Client Hints + Sec-Fetch-*. Market-share-weighted rotation. |
| 4 Backends | httpx (default, HTTP/2), aiohttp (async), curl_cffi (TLS fingerprint), cloudscraper (CF bypass) |
| Rate Limiting | Per-domain + global token bucket with configurable burst |
| Retry | Exponential backoff via Tenacity with configurable status codes |
| Scraping | CSS selectors, link harvesting, form parsing, table extraction |
| Metadata | HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core |
| CLI | pyfetcher fetch, scrape, headers, user-agent, robots, download |
| TUI | Interactive Textual terminal UI for building and inspecting requests |
Infrastructure (optional extras)
| Feature | Extra | Description |
|---|---|---|
| Pipeline | [pipeline] |
Event-driven Crawl -> Scrape -> Download via Postgres LISTEN/NOTIFY |
| Database | [db] |
SQLAlchemy 2.0 async + Alembic. Jobs, pages, media, hosts, feeds, URL dedup |
| Object Store | [store] |
MinIO/S3 via aioboto3. Upload, download, presigned URLs |
| Downloaders | [downloaders] |
yt-dlp (progress hooks, info_dict) + gallery-dl (170+ sites) |
| Extractors | [extractors] |
trafilatura + readability-lxml fallback, html2text, markdownify |
| Media | [media] |
Audio (mutagen), video (pymediainfo), image (exifread), PDF (pypdf) |
| Browser | [browser] |
Playwright + stealth for JS-heavy sites |
| Feeds | [feeds] |
RSS/Atom monitoring with adaptive polling |
| Crawler | [pipeline] |
URL frontier, spider + router, dedup, politeness, sitemap discovery |
Installation
pip install fetchkit
All optional extras:
pip install 'fetchkit[tui]' # Textual TUI
pip install 'fetchkit[curl]' # curl_cffi TLS fingerprinting
pip install 'fetchkit[cloudscraper]' # Cloudflare bypass
pip install 'fetchkit[db]' # Postgres + SQLAlchemy + Alembic
pip install 'fetchkit[store]' # MinIO/S3 object storage
pip install 'fetchkit[pipeline]' # db + store (full pipeline)
pip install 'fetchkit[downloaders]' # yt-dlp + gallery-dl
pip install 'fetchkit[extractors]' # trafilatura, readability, html2text
pip install 'fetchkit[media]' # Audio/video/image/PDF metadata
pip install 'fetchkit[browser]' # Playwright + stealth
pip install 'fetchkit[feeds]' # RSS/Atom feed parsing
pip install 'fetchkit[full]' # Everything
Quick Start
Fetch
from pyfetcher import fetch, afetch, FetchRequest
import asyncio
# Sync
response = fetch("https://example.com")
print(response.status_code, response.ok)
# Async
response = asyncio.run(afetch("https://example.com"))
Browser Profiles & Headers
from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.headers.ua import random_user_agent
from pyfetcher.fetch.service import FetchService
# Fixed profile (Chrome on Windows)
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))
# Rotating profiles weighted by real-world market share
service = FetchService(header_provider=RotatingHeaderProvider())
# Just need a user-agent string?
ua = random_user_agent(browser="firefox", platform="macOS")
Scraping
from pyfetcher.scrape import (
extract_links, extract_text, extract_table,
extract_forms, extract_readable_text,
)
from pyfetcher.scrape.robots import parse_robots_txt, is_allowed
# CSS selectors
titles = extract_text(html, "h1.title")
rows = extract_table(html, "table.data")
# Links with internal/external classification
links = extract_links(html, base_url=url, same_domain_only=True)
# Forms with field extraction
forms = extract_forms(html, base_url=url)
print(forms[0].action, forms[0].to_dict())
# Robots.txt
rules = parse_robots_txt(robots_content)
allowed = is_allowed(rules, "/admin", user_agent="MyBot")
Rate-Limited Fetching
from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy
limiter = DomainRateLimiter(
default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
domain_policies={
"api.example.com": RateLimitPolicy(requests_per_second=0.5),
},
)
service = FetchService(rate_limiter=limiter)
Content Extraction
from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown, html_to_plaintext
# Article text (trafilatura with readability-lxml fallback)
article = extract_article_text(html, url="https://example.com/post")
# HTML -> Markdown
md = html_to_markdown(html)
yt-dlp & gallery-dl
from pyfetcher.downloaders.ytdlp import YtdlpDownloader
from pyfetcher.downloaders.gallerydl import GalleryDlDownloader
# yt-dlp with progress tracking
yt = YtdlpDownloader()
info = await yt.extract_info("https://youtube.com/watch?v=dQw4w9WgXcQ")
results = await yt.download(url, output_dir="./videos",
progress_callback=lambda p: print(f"{p.status}: {p.percent}"))
# gallery-dl for image galleries (170+ supported sites)
gdl = GalleryDlDownloader()
results = await gdl.download("https://imgur.com/gallery/...", output_dir="./images")
CLI
# Fetch with any backend
pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi
# Preview generated headers
pyfetcher headers --profile chrome_win
pyfetcher headers --browser firefox -o json
pyfetcher headers --list
# Scrape content
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher scrape https://example.com --meta
# Random user-agents
pyfetcher user-agent --browser chrome --count 5
pyfetcher user-agent --mobile
# Check robots.txt
pyfetcher robots https://example.com -p /admin
# Download files
pyfetcher download https://example.com/file.pdf ./file.pdf
Pipeline
The event-driven pipeline connects three stages via Postgres LISTEN/NOTIFY:
Seeds / RSS / Sitemap
|
[Crawl Stage] ──NOTIFY──> [Scrape Stage] ──NOTIFY──> [Download Stage]
| | |
v v v
pages table pages (enriched) media_assets
+ new crawl jobs + download jobs + MinIO objects
Setup
make infra-up # Start Postgres + MinIO
make migrate # Run Alembic migrations
make pipeline # Start all workers
Programmatic
from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig
runner = PipelineRunner(PyfetcherConfig(
crawl_concurrency=10,
scrape_concurrency=20,
download_concurrency=5,
))
await runner.start()
Custom Spiders
from pyfetcher.crawler.spider import Spider, SpiderResult
spider = Spider(name="my-spider")
@spider.router.add(r"/blog/\d{4}/")
async def handle_post(url, response):
return SpiderResult(
discovered_urls=[...],
items=[{"title": "...", "content": "..."}],
)
MCP Server (AI Agent Integration)
fetchkit ships as an MCP server, making all its capabilities available to AI agents (Claude, LangChain, LangGraph, and any MCP-compatible client). This turns fetchkit into autonomous agentic infrastructure -- LLMs can fetch, scrape, extract, and download without custom code.
Why MCP?
Traditional scraping requires writing code for every site. With fetchkit's MCP server, an AI agent can:
- Autonomously research topics by fetching pages, extracting content, and following links
- Audit websites by checking metadata, robots.txt, sitemaps, and page structure
- Extract structured data from any page using CSS selectors, table parsing, or article extraction
- Download media with progress tracking and checksum verification
- Generate realistic requests using browser profiles that pass bot detection
All 16 tools return structured Pydantic models so the LLM gets clean, typed data -- not raw HTML.
Quick Start
pip install 'fetchkit[mcp]'
# Run as stdio server (Claude Desktop / Claude Code)
pyfetcher-mcp
# Run as HTTP server (LangChain / remote agents)
pyfetcher-mcp --http 8000
# Or via Makefile
make mcp # stdio
make mcp-http # HTTP on port 8000
Available Tools (16)
| Tool | What it does |
|---|---|
fetch_url |
Fetch any URL with browser headers, returns status + body + timing |
fetch_multiple |
Batch fetch with concurrency control |
scrape_css |
Extract content via CSS selectors |
scrape_links |
Harvest links with internal/external classification |
scrape_text |
Extract readable text (strips scripts, nav, etc.) |
scrape_metadata |
Title, description, Open Graph, favicons |
scrape_forms |
Parse forms with fields and default values |
scrape_table |
Extract HTML table data as rows |
check_robots |
Check robots.txt rules for any path |
parse_sitemap |
Parse XML sitemaps |
generate_headers |
Preview full browser header sets |
list_profiles |
Show all 11 browser profiles |
random_user_agent |
Generate random realistic UAs |
extract_article |
Article text + markdown via trafilatura |
convert_html |
HTML -> markdown or plaintext |
download_file |
Download with checksum verification |
Resources & Prompts
Resources expose data for context: pyfetcher://profiles, pyfetcher://backends, pyfetcher://version.
Prompts provide templates: web_research, site_audit, scrape_guide, compare_pages.
Use with LangChain
from langchain_mcp_adapters import MultiServerMCPClient
client = MultiServerMCPClient({
"pyfetcher": {"transport": "http", "url": "http://localhost:8000/mcp"}
})
tools = await client.get_tools() # 16 LangChain tools ready to use
# Build an agent
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(model, tools)
Use with Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"pyfetcher": {
"command": "pyfetcher-mcp",
"args": []
}
}
}
Transport Backends
| Backend | Sync | Async | Stream | TLS Fingerprint | CF Bypass | Install |
|---|---|---|---|---|---|---|
| httpx | Y | Y | Y | - | - | (core) |
| aiohttp | - | Y | Y | - | - | (core) |
| curl_cffi | Y | Y | Y | Y | - | [curl] |
| cloudscraper | Y | - | - | - | Y | [cloudscraper] |
Development
git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all # pdm install with all deps
make test # 358 tests
make check # format + lint + test
make infra-up && make migrate # start Postgres + MinIO
Makefile Targets
make help Show all targets
make install-all Install everything
make test Run 358 tests
make test-cov Tests with coverage report
make fmt Format with trunk
make lint Lint with trunk
make check Format + lint + test
make infra-up Start Postgres + MinIO
make infra-down Stop infrastructure
make migrate Run Alembic migrations
make pipeline Run crawl->scrape->download
make build Build wheel + sdist
make publish Publish to PyPI
make docs Build Sphinx docs
make clean Remove build artifacts
Documentation
Quick Start | Headers | Scraping | Pipeline | Infrastructure | CLI | API Reference
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchkit-0.3.0.tar.gz.
File metadata
- Download URL: fetchkit-0.3.0.tar.gz
- Upload date:
- Size: 124.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf01dbddb672b0451b70d28a2bb7fd856581540fe77d84d54b344c731e871ba7
|
|
| MD5 |
4c79ae8d8d0cfcbda2a30385152bf92b
|
|
| BLAKE2b-256 |
c3d74a247ce6d25e95e467a828ba581a10a552c3f98473a21a2f8d6f02c7e63e
|
File details
Details for the file fetchkit-0.3.0-py3-none-any.whl.
File metadata
- Download URL: fetchkit-0.3.0-py3-none-any.whl
- Upload date:
- Size: 122.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a92c59e0e3ccbd298e90646fc74c154ce643e8ea347032b1756903f56e51e90b
|
|
| MD5 |
051de55bd75c2907d620b8a944a3f397
|
|
| BLAKE2b-256 |
ea02508cc340d349f1821d726bbf79133a432186b689927044fcff34574d1b50
|