Skip to main content

Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline

Project description

pyfetcher

CI Docs Python 3.13+ License: MIT Ruff PDM

Advanced web fetching, scraping, and content acquisition toolkit for Python. From simple HTTP requests to full crawl-scrape-download pipelines backed by Postgres and MinIO.

Features

Core

  • Realistic browser headers -- 11 browser profiles (Chrome, Firefox, Safari, Edge) across platforms with consistent UA, Client Hints, and Sec-Fetch-* headers. Market-share-weighted rotation.
  • 4 HTTP backends -- httpx, aiohttp, curl_cffi (TLS fingerprinting), cloudscraper (Cloudflare bypass).
  • Rate limiting -- Per-domain and global token bucket rate limiting.
  • Retry -- Configurable exponential backoff with retryable status codes via Tenacity.
  • Scraping -- CSS selectors, link harvesting, form parsing, robots.txt, sitemap parsing, content extraction.
  • Metadata -- HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core.
  • CLI & TUI -- pyfetcher CLI with 6 commands + interactive Textual TUI.

Infrastructure (optional)

  • Event-driven pipeline -- Crawl -> Scrape -> Download stages via Postgres LISTEN/NOTIFY.
  • Database -- SQLAlchemy 2.0 async + Alembic migrations. Models for jobs, pages, media, hosts, feeds, URL dedup.
  • Object storage -- MinIO/S3 via aioboto3 with presigned URLs and key generation.
  • Downloaders -- Deep yt-dlp integration (progress hooks, info_dict), gallery-dl (job API), direct HTTP streaming.
  • Extractors -- trafilatura + readability-lxml fallback, html2text, markdownify, media metadata (audio/video/image/PDF).
  • Crawler -- URL frontier with dedup, spider + router, politeness enforcement, RSS/Atom feed monitoring.
  • Docker Compose -- Postgres 17 + MinIO with health checks, .env config, Alembic migrations.

Installation

pip install pyfetcher

Optional dependency groups:

pip install 'pyfetcher[tui]'          # Textual TUI
pip install 'pyfetcher[metadata]'      # extruct structured data
pip install 'pyfetcher[curl]'          # curl_cffi TLS fingerprinting
pip install 'pyfetcher[cloudscraper]'  # Cloudflare bypass
pip install 'pyfetcher[db]'            # Postgres + SQLAlchemy + Alembic
pip install 'pyfetcher[store]'         # MinIO/S3 object storage
pip install 'pyfetcher[pipeline]'      # db + store (full pipeline)
pip install 'pyfetcher[downloaders]'   # yt-dlp + gallery-dl
pip install 'pyfetcher[extractors]'    # trafilatura, readability, html2text
pip install 'pyfetcher[media]'         # mutagen, pymediainfo, exifread, pypdf
pip install 'pyfetcher[browser]'       # Playwright + stealth
pip install 'pyfetcher[feeds]'         # feedparser + dateparser
pip install 'pyfetcher[full]'          # Everything

Quick Start

Fetch a URL

from pyfetcher import fetch, FetchRequest

response = fetch("https://example.com")
print(response.status_code, response.ok)

Async Fetch

import asyncio
from pyfetcher import afetch

async def main():
    response = await afetch("https://example.com")
    print(response.status_code)

asyncio.run(main())

Browser Profiles

from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.fetch.service import FetchService

# Fixed profile
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))

# Rotating profiles (weighted by market share)
service = FetchService(header_provider=RotatingHeaderProvider())

Scraping

from pyfetcher.scrape import extract_links, extract_text, extract_readable_text

links = extract_links(html, base_url="https://example.com")
headings = extract_text(html, "h1")
content = extract_readable_text(html)

Rate-Limited Fetching

from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy

limiter = DomainRateLimiter(
    default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
    domain_policies={"api.example.com": RateLimitPolicy(requests_per_second=0.5)},
)
service = FetchService(rate_limiter=limiter)

Content Extraction

from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown

text = extract_article_text(html, url="https://example.com/article")
markdown = html_to_markdown(html)

yt-dlp Integration

from pyfetcher.downloaders.ytdlp import YtdlpDownloader

dl = YtdlpDownloader()
info = await dl.extract_info("https://youtube.com/watch?v=...")
results = await dl.download("https://youtube.com/watch?v=...", output_dir="./downloads")

Pipeline (Crawl -> Scrape -> Download)

from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig

runner = PipelineRunner(PyfetcherConfig())
await runner.start()  # Runs all 3 stages with Postgres job queue

CLI

pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi
pyfetcher headers --profile chrome_win
pyfetcher headers --list
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher user-agent --browser chrome --count 5
pyfetcher robots https://example.com -p /admin
pyfetcher download https://example.com/file.pdf ./file.pdf

Infrastructure

Start Postgres + MinIO:

make infra-up          # docker compose up
make migrate           # run Alembic migrations
make pipeline          # start crawl->scrape->download workers

See make help for all available targets.

Development

git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all       # pdm install -G dev -G full
make test              # 358 tests
make check             # format + lint + test

Documentation

Full documentation at pyfetcher.readthedocs.io.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchkit-0.2.0.tar.gz (101.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetchkit-0.2.0-py3-none-any.whl (110.5 kB view details)

Uploaded Python 3

File details

Details for the file fetchkit-0.2.0.tar.gz.

File metadata

  • Download URL: fetchkit-0.2.0.tar.gz
  • Upload date:
  • Size: 101.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9808646d0801e9d9b5f3d307c342b0b04ddb454d556f8a7010f45aaa94c0badc
MD5 e34cd2710d3dd77b8e90889fdb5602c3
BLAKE2b-256 47fde060d024731ef65daedfdbafafb147827dc867f3e92b6982a95c91f507ef

See more details on using hashes here.

File details

Details for the file fetchkit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fetchkit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 110.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b9c82d69b81bd275841f9c701e816b0c3add6685872554ee7be386e26d64da1
MD5 97ea8a0e29ca8b70806da8c4ef98bd9a
BLAKE2b-256 f1a6e5a5e5270f2592073f9bfdcb32989c78d7f9ccc82dc884c0725eeb39c960

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page