Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Internet :: WWW/HTTP
- Software Development :: Libraries :: Python Modules
Typing
- Typed

Project description

fetchkit

Advanced web fetching, scraping, and content acquisition toolkit for Python.

From simple HTTP requests to full crawl-scrape-download pipelines backed by Postgres and MinIO.

Highlights

pip install fetchkit                   # Core: fetch, scrape, headers
pip install 'fetchkit[pipeline]'       # + Postgres job queue + MinIO storage
pip install 'fetchkit[full]'           # Everything including yt-dlp, Playwright, etc.

Fetch with realistic browser headers

from pyfetcher import fetch

response = fetch("https://example.com")
# Sends Chrome-like headers with Client Hints,
# Sec-Fetch-*, UA rotation automatically

Scrape anything

from pyfetcher.scrape import (
    extract_links, extract_text,
    extract_readable_text,
)

links = extract_links(html, base_url=url)
titles = extract_text(html, "h1")
article = extract_readable_text(html)

4 HTTP backends

from pyfetcher import FetchRequest, fetch

# TLS fingerprinting (bypass bot detection)
fetch(FetchRequest(url=url, backend="curl_cffi"))

# Cloudflare bypass
fetch(FetchRequest(url=url, backend="cloudscraper"))

Download media with yt-dlp

from pyfetcher.downloaders.ytdlp import YtdlpDownloader

dl = YtdlpDownloader()
info = await dl.extract_info(video_url)
results = await dl.download(video_url,
    output_dir="./media")

Features

Core Library

Feature	Description
Browser Headers	11 profiles (Chrome/Firefox/Safari/Edge) across 5 platforms. Consistent UA + Client Hints + Sec-Fetch-*. Market-share-weighted rotation.
4 Backends	`httpx` (default, HTTP/2), `aiohttp` (async), `curl_cffi` (TLS fingerprint), `cloudscraper` (CF bypass)
Rate Limiting	Per-domain + global token bucket with configurable burst
Retry	Exponential backoff via Tenacity with configurable status codes
Scraping	CSS selectors, link harvesting, form parsing, table extraction
Metadata	HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core
CLI	`pyfetcher fetch`, `scrape`, `headers`, `user-agent`, `robots`, `download`
TUI	Interactive Textual terminal UI for building and inspecting requests

Infrastructure (optional extras)

Feature	Extra	Description
Pipeline	`[pipeline]`	Event-driven Crawl -> Scrape -> Download via Postgres LISTEN/NOTIFY
Database	`[db]`	SQLAlchemy 2.0 async + Alembic. Jobs, pages, media, hosts, feeds, URL dedup
Object Store	`[store]`	MinIO/S3 via aioboto3. Upload, download, presigned URLs
Downloaders	`[downloaders]`	yt-dlp (progress hooks, info_dict) + gallery-dl (170+ sites)
Extractors	`[extractors]`	trafilatura + readability-lxml fallback, html2text, markdownify
Media	`[media]`	Audio (mutagen), video (pymediainfo), image (exifread), PDF (pypdf)
Browser	`[browser]`	Playwright + stealth for JS-heavy sites
Feeds	`[feeds]`	RSS/Atom monitoring with adaptive polling
Crawler	`[pipeline]`	URL frontier, spider + router, dedup, politeness, sitemap discovery

Installation

pip install fetchkit

All optional extras:

pip install 'fetchkit[tui]'            # Textual TUI
pip install 'fetchkit[curl]'           # curl_cffi TLS fingerprinting
pip install 'fetchkit[cloudscraper]'   # Cloudflare bypass
pip install 'fetchkit[db]'             # Postgres + SQLAlchemy + Alembic
pip install 'fetchkit[store]'          # MinIO/S3 object storage
pip install 'fetchkit[pipeline]'       # db + store (full pipeline)
pip install 'fetchkit[downloaders]'    # yt-dlp + gallery-dl
pip install 'fetchkit[extractors]'     # trafilatura, readability, html2text
pip install 'fetchkit[media]'          # Audio/video/image/PDF metadata
pip install 'fetchkit[browser]'        # Playwright + stealth
pip install 'fetchkit[feeds]'          # RSS/Atom feed parsing
pip install 'fetchkit[full]'           # Everything

Quick Start

Fetch

from pyfetcher import fetch, afetch, FetchRequest
import asyncio

# Sync
response = fetch("https://example.com")
print(response.status_code, response.ok)

# Async
response = asyncio.run(afetch("https://example.com"))

Browser Profiles & Headers

from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.headers.ua import random_user_agent
from pyfetcher.fetch.service import FetchService

# Fixed profile (Chrome on Windows)
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))

# Rotating profiles weighted by real-world market share
service = FetchService(header_provider=RotatingHeaderProvider())

# Just need a user-agent string?
ua = random_user_agent(browser="firefox", platform="macOS")

Scraping

from pyfetcher.scrape import (
    extract_links, extract_text, extract_table,
    extract_forms, extract_readable_text,
)
from pyfetcher.scrape.robots import parse_robots_txt, is_allowed

# CSS selectors
titles = extract_text(html, "h1.title")
rows = extract_table(html, "table.data")

# Links with internal/external classification
links = extract_links(html, base_url=url, same_domain_only=True)

# Forms with field extraction
forms = extract_forms(html, base_url=url)
print(forms[0].action, forms[0].to_dict())

# Robots.txt
rules = parse_robots_txt(robots_content)
allowed = is_allowed(rules, "/admin", user_agent="MyBot")

Rate-Limited Fetching

from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy

limiter = DomainRateLimiter(
    default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
    domain_policies={
        "api.example.com": RateLimitPolicy(requests_per_second=0.5),
    },
)
service = FetchService(rate_limiter=limiter)

Content Extraction

from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown, html_to_plaintext

# Article text (trafilatura with readability-lxml fallback)
article = extract_article_text(html, url="https://example.com/post")

# HTML -> Markdown
md = html_to_markdown(html)

yt-dlp & gallery-dl

from pyfetcher.downloaders.ytdlp import YtdlpDownloader
from pyfetcher.downloaders.gallerydl import GalleryDlDownloader

# yt-dlp with progress tracking
yt = YtdlpDownloader()
info = await yt.extract_info("https://youtube.com/watch?v=dQw4w9WgXcQ")
results = await yt.download(url, output_dir="./videos",
    progress_callback=lambda p: print(f"{p.status}: {p.percent}"))

# gallery-dl for image galleries (170+ supported sites)
gdl = GalleryDlDownloader()
results = await gdl.download("https://imgur.com/gallery/...", output_dir="./images")

CLI

# Fetch with any backend
pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi

# Preview generated headers
pyfetcher headers --profile chrome_win
pyfetcher headers --browser firefox -o json
pyfetcher headers --list

# Scrape content
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher scrape https://example.com --meta

# Random user-agents
pyfetcher user-agent --browser chrome --count 5
pyfetcher user-agent --mobile

# Check robots.txt
pyfetcher robots https://example.com -p /admin

# Download files
pyfetcher download https://example.com/file.pdf ./file.pdf

Pipeline

The event-driven pipeline connects three stages via Postgres LISTEN/NOTIFY:

Seeds / RSS / Sitemap
       |
  [Crawl Stage]  ──NOTIFY──>  [Scrape Stage]  ──NOTIFY──>  [Download Stage]
       |                             |                             |
       v                             v                             v
  pages table                 pages (enriched)              media_assets
  + new crawl jobs            + download jobs               + MinIO objects

Setup

make infra-up     # Start Postgres + MinIO
make migrate      # Run Alembic migrations
make pipeline     # Start all workers

Programmatic

from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig

runner = PipelineRunner(PyfetcherConfig(
    crawl_concurrency=10,
    scrape_concurrency=20,
    download_concurrency=5,
))
await runner.start()

Custom Spiders

from pyfetcher.crawler.spider import Spider, SpiderResult

spider = Spider(name="my-spider")

@spider.router.add(r"/blog/\d{4}/")
async def handle_post(url, response):
    return SpiderResult(
        discovered_urls=[...],
        items=[{"title": "...", "content": "..."}],
    )

MCP Server (AI Agent Integration)

fetchkit ships as an MCP server, making all its capabilities available to AI agents (Claude, LangChain, LangGraph, and any MCP-compatible client). This turns fetchkit into autonomous agentic infrastructure -- LLMs can fetch, scrape, extract, and download without custom code.

Why MCP?

Traditional scraping requires writing code for every site. With fetchkit's MCP server, an AI agent can:

Autonomously research topics by fetching pages, extracting content, and following links
Audit websites by checking metadata, robots.txt, sitemaps, and page structure
Extract structured data from any page using CSS selectors, table parsing, or article extraction
Download media with progress tracking and checksum verification
Generate realistic requests using browser profiles that pass bot detection

All 16 tools return structured Pydantic models so the LLM gets clean, typed data -- not raw HTML.

Quick Start

pip install 'fetchkit[mcp]'

# Run as stdio server (Claude Desktop / Claude Code)
pyfetcher-mcp

# Run as HTTP server (LangChain / remote agents)
pyfetcher-mcp --http 8000

# Or via Makefile
make mcp          # stdio
make mcp-http     # HTTP on port 8000

Available Tools (16)

Tool	What it does
`fetch_url`	Fetch any URL with browser headers, returns status + body + timing
`fetch_multiple`	Batch fetch with concurrency control
`scrape_css`	Extract content via CSS selectors
`scrape_links`	Harvest links with internal/external classification
`scrape_text`	Extract readable text (strips scripts, nav, etc.)
`scrape_metadata`	Title, description, Open Graph, favicons
`scrape_forms`	Parse forms with fields and default values
`scrape_table`	Extract HTML table data as rows
`check_robots`	Check robots.txt rules for any path
`parse_sitemap`	Parse XML sitemaps
`generate_headers`	Preview full browser header sets
`list_profiles`	Show all 11 browser profiles
`random_user_agent`	Generate random realistic UAs
`extract_article`	Article text + markdown via trafilatura
`convert_html`	HTML -> markdown or plaintext
`download_file`	Download with checksum verification

Resources & Prompts

Resources expose data for context: pyfetcher://profiles, pyfetcher://backends, pyfetcher://version.

Prompts provide templates: web_research, site_audit, scrape_guide, compare_pages.

Use with LangChain

from langchain_mcp_adapters import MultiServerMCPClient

client = MultiServerMCPClient({
    "pyfetcher": {"transport": "http", "url": "http://localhost:8000/mcp"}
})
tools = await client.get_tools()  # 16 LangChain tools ready to use

# Build an agent
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(model, tools)

Use with Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "pyfetcher": {
      "command": "pyfetcher-mcp",
      "args": []
    }
  }
}

Transport Backends

Backend	Sync	Async	Stream	TLS Fingerprint	CF Bypass	Install
httpx	Y	Y	Y	-	-	(core)
aiohttp	-	Y	Y	-	-	(core)
curl_cffi	Y	Y	Y	Y	-	`[curl]`
cloudscraper	Y	-	-	-	Y	`[cloudscraper]`

Development

git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all              # pdm install with all deps
make test                     # 358 tests
make check                    # format + lint + test
make infra-up && make migrate # start Postgres + MinIO

Makefile Targets

make help          Show all targets
make install-all   Install everything
make test          Run 358 tests
make test-cov      Tests with coverage report
make fmt           Format with trunk
make lint          Lint with trunk
make check         Format + lint + test
make infra-up      Start Postgres + MinIO
make infra-down    Stop infrastructure
make migrate       Run Alembic migrations
make pipeline      Run crawl->scrape->download
make build         Build wheel + sdist
make publish       Publish to PyPI
make docs          Build Sphinx docs
make clean         Remove build artifacts

Documentation

pr1m8.github.io/pyfetcher

License

MIT

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Internet :: WWW/HTTP
- Software Development :: Libraries :: Python Modules
Typing
- Typed

Release history Release notifications | RSS feed

0.3.1

Apr 1, 2026

This version

0.3.0

Apr 1, 2026

0.2.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchkit-0.3.0.tar.gz (124.6 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fetchkit-0.3.0-py3-none-any.whl (122.5 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file fetchkit-0.3.0.tar.gz.

File metadata

Download URL: fetchkit-0.3.0.tar.gz
Upload date: Apr 1, 2026
Size: 124.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`bf01dbddb672b0451b70d28a2bb7fd856581540fe77d84d54b344c731e871ba7`
MD5	`4c79ae8d8d0cfcbda2a30385152bf92b`
BLAKE2b-256	`c3d74a247ce6d25e95e467a828ba581a10a552c3f98473a21a2f8d6f02c7e63e`

See more details on using hashes here.

File details

Details for the file fetchkit-0.3.0-py3-none-any.whl.

File metadata

Download URL: fetchkit-0.3.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 122.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a92c59e0e3ccbd298e90646fc74c154ce643e8ea347032b1756903f56e51e90b`
MD5	`051de55bd75c2907d620b8a944a3f397`
BLAKE2b-256	`ea02508cc340d349f1821d726bbf79133a432186b689927044fcff34574d1b50`

See more details on using hashes here.

fetchkit 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

fetchkit

Highlights

Features

Core Library

Infrastructure (optional extras)

Installation

Quick Start

Fetch

Browser Profiles & Headers

Scraping

Rate-Limited Fetching

Content Extraction

yt-dlp & gallery-dl

CLI

Pipeline

Setup

Programmatic

Custom Spiders

MCP Server (AI Agent Integration)

Why MCP?

Quick Start

Available Tools (16)

Resources & Prompts

Use with LangChain

Use with Claude Desktop

Transport Backends

Development

Makefile Targets

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes