Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Internet :: WWW/HTTP
- Software Development :: Libraries :: Python Modules
Typing
- Typed

Project description

pyfetcher

Advanced web fetching, scraping, and content acquisition toolkit for Python. From simple HTTP requests to full crawl-scrape-download pipelines backed by Postgres and MinIO.

Features

Core

Realistic browser headers -- 11 browser profiles (Chrome, Firefox, Safari, Edge) across platforms with consistent UA, Client Hints, and Sec-Fetch-* headers. Market-share-weighted rotation.
4 HTTP backends -- httpx, aiohttp, curl_cffi (TLS fingerprinting), cloudscraper (Cloudflare bypass).
Rate limiting -- Per-domain and global token bucket rate limiting.
Retry -- Configurable exponential backoff with retryable status codes via Tenacity.
Scraping -- CSS selectors, link harvesting, form parsing, robots.txt, sitemap parsing, content extraction.
Metadata -- HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core.
CLI & TUI -- pyfetcher CLI with 6 commands + interactive Textual TUI.

Infrastructure (optional)

Event-driven pipeline -- Crawl -> Scrape -> Download stages via Postgres LISTEN/NOTIFY.
Database -- SQLAlchemy 2.0 async + Alembic migrations. Models for jobs, pages, media, hosts, feeds, URL dedup.
Object storage -- MinIO/S3 via aioboto3 with presigned URLs and key generation.
Downloaders -- Deep yt-dlp integration (progress hooks, info_dict), gallery-dl (job API), direct HTTP streaming.
Extractors -- trafilatura + readability-lxml fallback, html2text, markdownify, media metadata (audio/video/image/PDF).
Crawler -- URL frontier with dedup, spider + router, politeness enforcement, RSS/Atom feed monitoring.
Docker Compose -- Postgres 17 + MinIO with health checks, .env config, Alembic migrations.

Installation

pip install pyfetcher

Optional dependency groups:

pip install 'pyfetcher[tui]'          # Textual TUI
pip install 'pyfetcher[metadata]'      # extruct structured data
pip install 'pyfetcher[curl]'          # curl_cffi TLS fingerprinting
pip install 'pyfetcher[cloudscraper]'  # Cloudflare bypass
pip install 'pyfetcher[db]'            # Postgres + SQLAlchemy + Alembic
pip install 'pyfetcher[store]'         # MinIO/S3 object storage
pip install 'pyfetcher[pipeline]'      # db + store (full pipeline)
pip install 'pyfetcher[downloaders]'   # yt-dlp + gallery-dl
pip install 'pyfetcher[extractors]'    # trafilatura, readability, html2text
pip install 'pyfetcher[media]'         # mutagen, pymediainfo, exifread, pypdf
pip install 'pyfetcher[browser]'       # Playwright + stealth
pip install 'pyfetcher[feeds]'         # feedparser + dateparser
pip install 'pyfetcher[full]'          # Everything

Quick Start

Fetch a URL

from pyfetcher import fetch, FetchRequest

response = fetch("https://example.com")
print(response.status_code, response.ok)

Async Fetch

import asyncio
from pyfetcher import afetch

async def main():
    response = await afetch("https://example.com")
    print(response.status_code)

asyncio.run(main())

Browser Profiles

from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.fetch.service import FetchService

# Fixed profile
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))

# Rotating profiles (weighted by market share)
service = FetchService(header_provider=RotatingHeaderProvider())

Scraping

from pyfetcher.scrape import extract_links, extract_text, extract_readable_text

links = extract_links(html, base_url="https://example.com")
headings = extract_text(html, "h1")
content = extract_readable_text(html)

Rate-Limited Fetching

from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy

limiter = DomainRateLimiter(
    default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
    domain_policies={"api.example.com": RateLimitPolicy(requests_per_second=0.5)},
)
service = FetchService(rate_limiter=limiter)

Content Extraction

from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown

text = extract_article_text(html, url="https://example.com/article")
markdown = html_to_markdown(html)

yt-dlp Integration

from pyfetcher.downloaders.ytdlp import YtdlpDownloader

dl = YtdlpDownloader()
info = await dl.extract_info("https://youtube.com/watch?v=...")
results = await dl.download("https://youtube.com/watch?v=...", output_dir="./downloads")

Pipeline (Crawl -> Scrape -> Download)

from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig

runner = PipelineRunner(PyfetcherConfig())
await runner.start()  # Runs all 3 stages with Postgres job queue

CLI

pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi
pyfetcher headers --profile chrome_win
pyfetcher headers --list
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher user-agent --browser chrome --count 5
pyfetcher robots https://example.com -p /admin
pyfetcher download https://example.com/file.pdf ./file.pdf

Infrastructure

Start Postgres + MinIO:

make infra-up          # docker compose up
make migrate           # run Alembic migrations
make pipeline          # start crawl->scrape->download workers

See make help for all available targets.

Development

git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all       # pdm install -G dev -G full
make test              # 358 tests
make check             # format + lint + test

Documentation

Full documentation at pyfetcher.readthedocs.io.

License

MIT

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Internet :: WWW/HTTP
- Software Development :: Libraries :: Python Modules
Typing
- Typed

Release history Release notifications | RSS feed

0.3.1

Apr 1, 2026

0.3.0

Apr 1, 2026

This version

0.2.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchkit-0.2.0.tar.gz (101.6 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fetchkit-0.2.0-py3-none-any.whl (110.5 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file fetchkit-0.2.0.tar.gz.

File metadata

Download URL: fetchkit-0.2.0.tar.gz
Upload date: Apr 1, 2026
Size: 101.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9808646d0801e9d9b5f3d307c342b0b04ddb454d556f8a7010f45aaa94c0badc`
MD5	`e34cd2710d3dd77b8e90889fdb5602c3`
BLAKE2b-256	`47fde060d024731ef65daedfdbafafb147827dc867f3e92b6982a95c91f507ef`

See more details on using hashes here.

File details

Details for the file fetchkit-0.2.0-py3-none-any.whl.

File metadata

Download URL: fetchkit-0.2.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 110.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0

File hashes

Hashes for fetchkit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b9c82d69b81bd275841f9c701e816b0c3add6685872554ee7be386e26d64da1`
MD5	`97ea8a0e29ca8b70806da8c4ef98bd9a`
BLAKE2b-256	`f1a6e5a5e5270f2592073f9bfdcb32989c78d7f9ccc82dc884c0725eeb39c960`

See more details on using hashes here.

fetchkit 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pyfetcher

Features

Core

Infrastructure (optional)

Installation

Quick Start

Fetch a URL

Async Fetch

Browser Profiles

Scraping

Rate-Limited Fetching

Content Extraction

yt-dlp Integration

Pipeline (Crawl -> Scrape -> Download)

CLI

Infrastructure

Development

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes