Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline
Project description
pyfetcher
Advanced web fetching, scraping, and content acquisition toolkit for Python. From simple HTTP requests to full crawl-scrape-download pipelines backed by Postgres and MinIO.
Features
Core
- Realistic browser headers -- 11 browser profiles (Chrome, Firefox, Safari, Edge) across platforms with consistent UA, Client Hints, and Sec-Fetch-* headers. Market-share-weighted rotation.
- 4 HTTP backends -- httpx, aiohttp, curl_cffi (TLS fingerprinting), cloudscraper (Cloudflare bypass).
- Rate limiting -- Per-domain and global token bucket rate limiting.
- Retry -- Configurable exponential backoff with retryable status codes via Tenacity.
- Scraping -- CSS selectors, link harvesting, form parsing, robots.txt, sitemap parsing, content extraction.
- Metadata -- HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core.
- CLI & TUI --
pyfetcherCLI with 6 commands + interactive Textual TUI.
Infrastructure (optional)
- Event-driven pipeline -- Crawl -> Scrape -> Download stages via Postgres LISTEN/NOTIFY.
- Database -- SQLAlchemy 2.0 async + Alembic migrations. Models for jobs, pages, media, hosts, feeds, URL dedup.
- Object storage -- MinIO/S3 via aioboto3 with presigned URLs and key generation.
- Downloaders -- Deep yt-dlp integration (progress hooks, info_dict), gallery-dl (job API), direct HTTP streaming.
- Extractors -- trafilatura + readability-lxml fallback, html2text, markdownify, media metadata (audio/video/image/PDF).
- Crawler -- URL frontier with dedup, spider + router, politeness enforcement, RSS/Atom feed monitoring.
- Docker Compose -- Postgres 17 + MinIO with health checks,
.envconfig, Alembic migrations.
Installation
pip install pyfetcher
Optional dependency groups:
pip install 'pyfetcher[tui]' # Textual TUI
pip install 'pyfetcher[metadata]' # extruct structured data
pip install 'pyfetcher[curl]' # curl_cffi TLS fingerprinting
pip install 'pyfetcher[cloudscraper]' # Cloudflare bypass
pip install 'pyfetcher[db]' # Postgres + SQLAlchemy + Alembic
pip install 'pyfetcher[store]' # MinIO/S3 object storage
pip install 'pyfetcher[pipeline]' # db + store (full pipeline)
pip install 'pyfetcher[downloaders]' # yt-dlp + gallery-dl
pip install 'pyfetcher[extractors]' # trafilatura, readability, html2text
pip install 'pyfetcher[media]' # mutagen, pymediainfo, exifread, pypdf
pip install 'pyfetcher[browser]' # Playwright + stealth
pip install 'pyfetcher[feeds]' # feedparser + dateparser
pip install 'pyfetcher[full]' # Everything
Quick Start
Fetch a URL
from pyfetcher import fetch, FetchRequest
response = fetch("https://example.com")
print(response.status_code, response.ok)
Async Fetch
import asyncio
from pyfetcher import afetch
async def main():
response = await afetch("https://example.com")
print(response.status_code)
asyncio.run(main())
Browser Profiles
from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.fetch.service import FetchService
# Fixed profile
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))
# Rotating profiles (weighted by market share)
service = FetchService(header_provider=RotatingHeaderProvider())
Scraping
from pyfetcher.scrape import extract_links, extract_text, extract_readable_text
links = extract_links(html, base_url="https://example.com")
headings = extract_text(html, "h1")
content = extract_readable_text(html)
Rate-Limited Fetching
from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy
limiter = DomainRateLimiter(
default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
domain_policies={"api.example.com": RateLimitPolicy(requests_per_second=0.5)},
)
service = FetchService(rate_limiter=limiter)
Content Extraction
from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown
text = extract_article_text(html, url="https://example.com/article")
markdown = html_to_markdown(html)
yt-dlp Integration
from pyfetcher.downloaders.ytdlp import YtdlpDownloader
dl = YtdlpDownloader()
info = await dl.extract_info("https://youtube.com/watch?v=...")
results = await dl.download("https://youtube.com/watch?v=...", output_dir="./downloads")
Pipeline (Crawl -> Scrape -> Download)
from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig
runner = PipelineRunner(PyfetcherConfig())
await runner.start() # Runs all 3 stages with Postgres job queue
CLI
pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi
pyfetcher headers --profile chrome_win
pyfetcher headers --list
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher user-agent --browser chrome --count 5
pyfetcher robots https://example.com -p /admin
pyfetcher download https://example.com/file.pdf ./file.pdf
Infrastructure
Start Postgres + MinIO:
make infra-up # docker compose up
make migrate # run Alembic migrations
make pipeline # start crawl->scrape->download workers
See make help for all available targets.
Development
git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all # pdm install -G dev -G full
make test # 358 tests
make check # format + lint + test
Documentation
Full documentation at pyfetcher.readthedocs.io.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchkit-0.2.0.tar.gz.
File metadata
- Download URL: fetchkit-0.2.0.tar.gz
- Upload date:
- Size: 101.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9808646d0801e9d9b5f3d307c342b0b04ddb454d556f8a7010f45aaa94c0badc
|
|
| MD5 |
e34cd2710d3dd77b8e90889fdb5602c3
|
|
| BLAKE2b-256 |
47fde060d024731ef65daedfdbafafb147827dc867f3e92b6982a95c91f507ef
|
File details
Details for the file fetchkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fetchkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 110.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.7 CPython/3.13.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b9c82d69b81bd275841f9c701e816b0c3add6685872554ee7be386e26d64da1
|
|
| MD5 |
97ea8a0e29ca8b70806da8c4ef98bd9a
|
|
| BLAKE2b-256 |
f1a6e5a5e5270f2592073f9bfdcb32989c78d7f9ccc82dc884c0725eeb39c960
|