Skip to main content

Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline

Project description

fetchkit

Advanced web fetching, scraping, and content acquisition toolkit for Python.

From simple HTTP requests to full crawl-scrape-download pipelines backed by Postgres and MinIO.

PyPI Python Docs CI License Ruff PDM Tests


Installation | Quick Start | CLI | Pipeline | Documentation | Examples

Highlights

pip install fetchkit                   # Core: fetch, scrape, headers
pip install 'fetchkit[pipeline]'       # + Postgres job queue + MinIO storage
pip install 'fetchkit[full]'           # Everything including yt-dlp, Playwright, etc.

Fetch with realistic browser headers

from pyfetcher import fetch

response = fetch("https://example.com")
# Sends Chrome-like headers with Client Hints,
# Sec-Fetch-*, UA rotation automatically

Scrape anything

from pyfetcher.scrape import (
    extract_links, extract_text,
    extract_readable_text,
)

links = extract_links(html, base_url=url)
titles = extract_text(html, "h1")
article = extract_readable_text(html)

4 HTTP backends

from pyfetcher import FetchRequest, fetch

# TLS fingerprinting (bypass bot detection)
fetch(FetchRequest(url=url, backend="curl_cffi"))

# Cloudflare bypass
fetch(FetchRequest(url=url, backend="cloudscraper"))

Download media with yt-dlp

from pyfetcher.downloaders.ytdlp import YtdlpDownloader

dl = YtdlpDownloader()
info = await dl.extract_info(video_url)
results = await dl.download(video_url,
    output_dir="./media")

Features

Core Library

Feature Description
Browser Headers 11 profiles (Chrome/Firefox/Safari/Edge) across 5 platforms. Consistent UA + Client Hints + Sec-Fetch-*. Market-share-weighted rotation.
4 Backends httpx (default, HTTP/2), aiohttp (async), curl_cffi (TLS fingerprint), cloudscraper (CF bypass)
Rate Limiting Per-domain + global token bucket with configurable burst
Retry Exponential backoff via Tenacity with configurable status codes
Scraping CSS selectors, link harvesting, form parsing, table extraction
Metadata HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core
CLI pyfetcher fetch, scrape, headers, user-agent, robots, download
TUI Interactive Textual terminal UI for building and inspecting requests

Infrastructure (optional extras)

Feature Extra Description
Pipeline [pipeline] Event-driven Crawl -> Scrape -> Download via Postgres LISTEN/NOTIFY
Database [db] SQLAlchemy 2.0 async + Alembic. Jobs, pages, media, hosts, feeds, URL dedup
Object Store [store] MinIO/S3 via aioboto3. Upload, download, presigned URLs
Downloaders [downloaders] yt-dlp (progress hooks, info_dict) + gallery-dl (170+ sites)
Extractors [extractors] trafilatura + readability-lxml fallback, html2text, markdownify
Media [media] Audio (mutagen), video (pymediainfo), image (exifread), PDF (pypdf)
Browser [browser] Playwright + stealth for JS-heavy sites
Feeds [feeds] RSS/Atom monitoring with adaptive polling
Crawler [pipeline] URL frontier, spider + router, dedup, politeness, sitemap discovery

Installation

pip install fetchkit

All optional extras:

pip install 'fetchkit[tui]'            # Textual TUI
pip install 'fetchkit[curl]'           # curl_cffi TLS fingerprinting
pip install 'fetchkit[cloudscraper]'   # Cloudflare bypass
pip install 'fetchkit[db]'             # Postgres + SQLAlchemy + Alembic
pip install 'fetchkit[store]'          # MinIO/S3 object storage
pip install 'fetchkit[pipeline]'       # db + store (full pipeline)
pip install 'fetchkit[downloaders]'    # yt-dlp + gallery-dl
pip install 'fetchkit[extractors]'     # trafilatura, readability, html2text
pip install 'fetchkit[media]'          # Audio/video/image/PDF metadata
pip install 'fetchkit[browser]'        # Playwright + stealth
pip install 'fetchkit[feeds]'          # RSS/Atom feed parsing
pip install 'fetchkit[full]'           # Everything

Quick Start

Fetch

from pyfetcher import fetch, afetch, FetchRequest
import asyncio

# Sync
response = fetch("https://example.com")
print(response.status_code, response.ok)

# Async
response = asyncio.run(afetch("https://example.com"))

Browser Profiles & Headers

from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.headers.ua import random_user_agent
from pyfetcher.fetch.service import FetchService

# Fixed profile (Chrome on Windows)
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))

# Rotating profiles weighted by real-world market share
service = FetchService(header_provider=RotatingHeaderProvider())

# Just need a user-agent string?
ua = random_user_agent(browser="firefox", platform="macOS")

Scraping

from pyfetcher.scrape import (
    extract_links, extract_text, extract_table,
    extract_forms, extract_readable_text,
)
from pyfetcher.scrape.robots import parse_robots_txt, is_allowed

# CSS selectors
titles = extract_text(html, "h1.title")
rows = extract_table(html, "table.data")

# Links with internal/external classification
links = extract_links(html, base_url=url, same_domain_only=True)

# Forms with field extraction
forms = extract_forms(html, base_url=url)
print(forms[0].action, forms[0].to_dict())

# Robots.txt
rules = parse_robots_txt(robots_content)
allowed = is_allowed(rules, "/admin", user_agent="MyBot")

Rate-Limited Fetching

from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy

limiter = DomainRateLimiter(
    default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
    domain_policies={
        "api.example.com": RateLimitPolicy(requests_per_second=0.5),
    },
)
service = FetchService(rate_limiter=limiter)

Content Extraction

from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown, html_to_plaintext

# Article text (trafilatura with readability-lxml fallback)
article = extract_article_text(html, url="https://example.com/post")

# HTML -> Markdown
md = html_to_markdown(html)

yt-dlp & gallery-dl

from pyfetcher.downloaders.ytdlp import YtdlpDownloader
from pyfetcher.downloaders.gallerydl import GalleryDlDownloader

# yt-dlp with progress tracking
yt = YtdlpDownloader()
info = await yt.extract_info("https://youtube.com/watch?v=dQw4w9WgXcQ")
results = await yt.download(url, output_dir="./videos",
    progress_callback=lambda p: print(f"{p.status}: {p.percent}"))

# gallery-dl for image galleries (170+ supported sites)
gdl = GalleryDlDownloader()
results = await gdl.download("https://imgur.com/gallery/...", output_dir="./images")

CLI

# Fetch with any backend
pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi

# Preview generated headers
pyfetcher headers --profile chrome_win
pyfetcher headers --browser firefox -o json
pyfetcher headers --list

# Scrape content
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher scrape https://example.com --meta

# Random user-agents
pyfetcher user-agent --browser chrome --count 5
pyfetcher user-agent --mobile

# Check robots.txt
pyfetcher robots https://example.com -p /admin

# Download files
pyfetcher download https://example.com/file.pdf ./file.pdf

Pipeline

The event-driven pipeline connects three stages via Postgres LISTEN/NOTIFY:

Seeds / RSS / Sitemap
       |
  [Crawl Stage]  ──NOTIFY──>  [Scrape Stage]  ──NOTIFY──>  [Download Stage]
       |                             |                             |
       v                             v                             v
  pages table                 pages (enriched)              media_assets
  + new crawl jobs            + download jobs               + MinIO objects

Setup

make infra-up     # Start Postgres + MinIO
make migrate      # Run Alembic migrations
make pipeline     # Start all workers

Programmatic

from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig

runner = PipelineRunner(PyfetcherConfig(
    crawl_concurrency=10,
    scrape_concurrency=20,
    download_concurrency=5,
))
await runner.start()

Custom Spiders

from pyfetcher.crawler.spider import Spider, SpiderResult

spider = Spider(name="my-spider")

@spider.router.add(r"/blog/\d{4}/")
async def handle_post(url, response):
    return SpiderResult(
        discovered_urls=[...],
        items=[{"title": "...", "content": "..."}],
    )

MCP Server (AI Agent Integration)

fetchkit ships as an MCP server, making all its capabilities available to AI agents (Claude, LangChain, LangGraph, and any MCP-compatible client). This turns fetchkit into autonomous agentic infrastructure -- LLMs can fetch, scrape, extract, and download without custom code.

Why MCP?

Traditional scraping requires writing code for every site. With fetchkit's MCP server, an AI agent can:

  • Autonomously research topics by fetching pages, extracting content, and following links
  • Audit websites by checking metadata, robots.txt, sitemaps, and page structure
  • Extract structured data from any page using CSS selectors, table parsing, or article extraction
  • Download media with progress tracking and checksum verification
  • Generate realistic requests using browser profiles that pass bot detection

All 16 tools return structured Pydantic models so the LLM gets clean, typed data -- not raw HTML.

Quick Start

pip install 'fetchkit[mcp]'

# Run as stdio server (Claude Desktop / Claude Code)
pyfetcher-mcp

# Run as HTTP server (LangChain / remote agents)
pyfetcher-mcp --http 8000

# Or via Makefile
make mcp          # stdio
make mcp-http     # HTTP on port 8000

Available Tools (16)

Tool What it does
fetch_url Fetch any URL with browser headers, returns status + body + timing
fetch_multiple Batch fetch with concurrency control
scrape_css Extract content via CSS selectors
scrape_links Harvest links with internal/external classification
scrape_text Extract readable text (strips scripts, nav, etc.)
scrape_metadata Title, description, Open Graph, favicons
scrape_forms Parse forms with fields and default values
scrape_table Extract HTML table data as rows
check_robots Check robots.txt rules for any path
parse_sitemap Parse XML sitemaps
generate_headers Preview full browser header sets
list_profiles Show all 11 browser profiles
random_user_agent Generate random realistic UAs
extract_article Article text + markdown via trafilatura
convert_html HTML -> markdown or plaintext
download_file Download with checksum verification

Resources & Prompts

Resources expose data for context: pyfetcher://profiles, pyfetcher://backends, pyfetcher://version.

Prompts provide templates: web_research, site_audit, scrape_guide, compare_pages.

Use with LangChain

from langchain_mcp_adapters import MultiServerMCPClient

client = MultiServerMCPClient({
    "pyfetcher": {"transport": "http", "url": "http://localhost:8000/mcp"}
})
tools = await client.get_tools()  # 16 LangChain tools ready to use

# Build an agent
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(model, tools)

Use with Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "pyfetcher": {
      "command": "pyfetcher-mcp",
      "args": []
    }
  }
}

Transport Backends

Backend Sync Async Stream TLS Fingerprint CF Bypass Install
httpx Y Y Y - - (core)
aiohttp - Y Y - - (core)
curl_cffi Y Y Y Y - [curl]
cloudscraper Y - - - Y [cloudscraper]

Development

git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all              # pdm install with all deps
make test                     # 358 tests
make check                    # format + lint + test
make infra-up && make migrate # start Postgres + MinIO

Makefile Targets

make help          Show all targets
make install-all   Install everything
make test          Run 358 tests
make test-cov      Tests with coverage report
make fmt           Format with trunk
make lint          Lint with trunk
make check         Format + lint + test
make infra-up      Start Postgres + MinIO
make infra-down    Stop infrastructure
make migrate       Run Alembic migrations
make pipeline      Run crawl->scrape->download
make build         Build wheel + sdist
make publish       Publish to PyPI
make docs          Build Sphinx docs
make clean         Remove build artifacts

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchkit-0.3.0.tar.gz (124.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetchkit-0.3.0-py3-none-any.whl (122.5 kB view details)

Uploaded Python 3

File details

Details for the file fetchkit-0.3.0.tar.gz.

File metadata

  • Download URL: fetchkit-0.3.0.tar.gz
  • Upload date:
  • Size: 124.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bf01dbddb672b0451b70d28a2bb7fd856581540fe77d84d54b344c731e871ba7
MD5 4c79ae8d8d0cfcbda2a30385152bf92b
BLAKE2b-256 c3d74a247ce6d25e95e467a828ba581a10a552c3f98473a21a2f8d6f02c7e63e

See more details on using hashes here.

File details

Details for the file fetchkit-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: fetchkit-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 122.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a92c59e0e3ccbd298e90646fc74c154ce643e8ea347032b1756903f56e51e90b
MD5 051de55bd75c2907d620b8a944a3f397
BLAKE2b-256 ea02508cc340d349f1821d726bbf79133a432186b689927044fcff34574d1b50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page