Skip to main content

Fetch everything, for agents. Universal data acquisition with smart routing.

Project description

Maestro

maestro-fetch

One interface. Any source. Agent-ready output.

PyPI version Downloads Python 3.11+ CI License: MIT Skills Ecosystem

Give it any URL -- web page, PDF, spreadsheet, cloud file, video, binary dataset -- and get back clean markdown or structured data. Smart routing picks the right adapter; pluggable browser backends handle anti-bot and authentication. No API key required.


Quickstart

For AI Agents

# Claude Code -- install as a skill (Vercel skills ecosystem)
npx skills add maestro-ai-stack/maestro-fetch -y -g

# Claude Code -- install as a plugin (marketplace)
/plugin marketplace add maestro-ai-stack/maestro-fetch
/plugin install maestro-fetch@maestro-fetch

Works with: Claude Code | Cursor | Codex | Gemini CLI | OpenCode | Trae and any agent that speaks MCP or CLI tools.

For Developers

# Recommended (global command, no venv needed)
uv tool install maestro-fetch

# Or with all extras (PDF, media, browser, LLM, social)
uv tool install "maestro-fetch[all]"

# Classic pip
pip install maestro-fetch
mfetch "https://example.com"

Try it now:

$ mfetch "https://api.worldbank.org/v2/country/CN/indicator/NY.GDP.MKTP.CD?format=json&per_page=5"

## GDP (current US$) - China

| Year | GDP (USD)            |
|------|----------------------|
| 2024 | $17,794,782,410,032  |
| 2023 | $17,662,434,751,902  |
| 2022 | $17,963,170,547,847  |
| 2021 | $17,734,062,645,371  |
| 2020 | $14,687,674,437,370  |
$ mfetch "https://arxiv.org/pdf/2301.07041"

## Dissociating language and thought in large language models ...
(full paper text as clean markdown)

If you find this useful, consider giving it a star -- it helps others discover the project.


Why maestro-fetch?

AI agents need data from the web. Most rely on built-in tools like WebFetch (Claude Code), curl, or requests. Here's why mfetch is better:

mfetch vs built-in agent tools

Dimension mfetch WebFetch (Claude Code built-in)
Speed httpx direct — no LLM overhead HTTP GET + small model processing (extra round-trip)
Token cost Raw content → main model. Single pass. Small model summarizes → main model reads summary. Double pass.
Content quality Full raw markdown, tables as DataFrames, PDFs via Docling Summarized by small model — large pages truncated, details lost
Recall rate 4-tier browser fallback (Extension → CDP → httpx → Playwright), anti-bot bypass, login session reuse Plain HTTP GET only — no JS rendering, no auth, WAF blocks fail

mfetch vs other fetch tools

mfetch Firecrawl Jina Reader crawl4ai
Source types 7 adapters + community sources Web only Web only Web only
PDF / Excel / CSV Native (Docling + openpyxl) Separate tool No No
Video transcription yt-dlp + Whisper No No No
Cloud storage Google Drive, Dropbox, Baidu Pan No No No
Binary datasets GeoTIFF, NetCDF, Parquet, HDF5, Stata, ... No No No
Browser backends 4 pluggable (Extension, CDP, httpx, Playwright) Hosted only Hosted only Playwright only
Auth / login reuse CDP reuses Chrome sessions, cookie import No No No
Hosting Local, no API key required SaaS ($) SaaS ($) Local
Community adapters Extensible (economics, climate, social, ...) No No No
Cache SQLite + content-addressed + TTL + LRU No No No
Batch operations Concurrent with configurable parallelism API-based No No
Interactive sessions session start/click/fill/screenshot/eval No No No

maestro-fetch treats "fetch" as a universal problem -- not just web scraping. Give it any URI and it figures out the rest: route to the right adapter, pick a browser backend if needed, parse the content, return markdown or structured data.


Benchmarks

Tested on macOS (Apple Silicon), Python 3.11, uv 0.11.2. March 2026.

Installation

Method Time Notes
uv tool install "maestro-fetch[all]" ~8s (200 packages) Global command, no venv management
pip install "maestro-fetch[all]" ~45s Requires manual venv setup

Fetch speed (single URL, public static page)

Tool Pipeline Latency
mfetch (httpx) HTTP GET → html2text → raw markdown ~200ms
mfetch (Extension/CDP) Chrome tab → extract → markdown ~500ms
WebFetch HTTP GET → html2text → small LLM call → summary ~2-5s
curl + manual parse HTTP GET → raw HTML (no processing) ~150ms

Token efficiency

Tool Flow Effective token cost
mfetch Raw content → main model (Opus/Sonnet) processes it 1x
WebFetch Small model processes content (hidden tokens) → summary → main model ~2x (double pass)

Content fidelity

Scenario mfetch WebFetch
10 KB HTML page 100% content preserved ~90% (minor summarization)
100 KB HTML page 100% content preserved ~60% (significant truncation)
PDF with tables Tables as DataFrames, full text Not supported
JS-rendered SPA Full render via Extension/CDP Fails (no JS engine)
Login-required page CDP reuses Chrome session Fails (no auth)

Supported Sources

Adapter Source types Examples
web HTML pages, APIs, SPAs Any URL; falls back through Extension → CDP → httpx → Playwright
doc Documents and spreadsheets .pdf, .xlsx, .xls, .ods, .csv
binary Archives, geospatial, data science .zip, .parquet, .tif, .nc, .hdf5, .shp, .feather
cloud Cloud storage Google Drive, Google Docs/Sheets, Dropbox
media Video and audio YouTube, Vimeo (transcription via yt-dlp + Whisper)
baidu_pan Baidu Pan pan.baidu.com links via OAuth + PCS API
browser Authenticated / JS-heavy pages Playwright interactive sessions
source Community adapters World Bank, FRED, NOAA, academic datasets, ...

CLI Usage

Fetch any URL

mfetch "https://example.com"                       # auto-detect, markdown output
mfetch "https://example.com/report.pdf"            # PDF -> markdown
mfetch "https://example.com" --output json         # JSON output
mfetch "https://example.com" --timeout 120         # custom timeout
mfetch "https://example.com" --batch urls.txt      # batch from file

Community source adapters

mfetch source update                               # pull latest adapters
mfetch source list                                 # show all adapters
mfetch source list --category economics            # filter by category
mfetch source info worldbank/gdp                   # show args and examples
mfetch source run worldbank/gdp CN                 # fetch World Bank GDP for China

Interactive browser sessions

mfetch session start "https://login-required.com"
mfetch session fill "#email" "user@example.com"
mfetch session click "#submit"
mfetch session snapshot                            # current page as markdown
mfetch session screenshot                          # save screenshot
mfetch session end

Cache management

mfetch cache list                                  # show cached entries
mfetch cache clear                                 # clear all
mfetch cache clear --older-than 7d                 # evict old entries

Configuration

mfetch config init                                 # generate ~/.maestro-fetch/config.toml
mfetch config show                                 # display current config

Python SDK

from maestro_fetch import fetch, batch_fetch

# Auto-detect and fetch
result = await fetch("https://example.com/data")
result.content       # markdown text
result.source_type   # "web" | "doc" | "cloud" | "media" | "binary"
result.tables        # list[pd.DataFrame] (if tabular data found)
result.metadata      # provenance dict
result.raw_path      # Path to cached raw file

# Batch with concurrency
results = await batch_fetch(urls, concurrency=10)

# LLM structured extraction (requires ANTHROPIC_API_KEY or OPENAI_API_KEY)
result = await fetch(
    "https://worldbank.org/report.pdf",
    schema={"country": str, "gdp": float},
    provider="anthropic",
)

Installation

Recommended: uv (global command, no venv)

uv tool install maestro-fetch                # core only
uv tool install "maestro-fetch[all]"         # everything (PDF, media, browser, LLM, social)

pip

pip install maestro-fetch                    # core
pip install maestro-fetch[pdf]               # PDF + Excel (Docling, openpyxl)
pip install maestro-fetch[media]             # YouTube/audio (yt-dlp, Whisper)
pip install maestro-fetch[browser]           # Interactive sessions (Playwright)
pip install maestro-fetch[anthropic]         # Claude LLM extraction
pip install maestro-fetch[openai]            # GPT LLM extraction
pip install maestro-fetch[social]            # Twitter/Reddit API adapters
pip install maestro-fetch[all]               # Everything

Development setup

git clone https://github.com/maestro-ai-stack/maestro-fetch.git
cd maestro-fetch
uv sync --extra dev                          # or: python3.11 -m venv .venv && pip install -e ".[dev]"
pytest tests/ -v

Works With

maestro-fetch integrates as a tool or skill in these AI agent environments:

  • Claude Code -- via skills ecosystem or plugin marketplace
  • Cursor -- as a CLI tool in agent mode
  • OpenAI Codex -- as a shell tool
  • Gemini CLI -- as an MCP tool
  • OpenCode / Trae -- via CLI or MCP bridge

See the maestro-fetch skill definition for integration details.


Architecture

CLI / SDK / MCP
       ↓
   Router (URL type detection via regex)
       ↓
   Adapter dispatch (priority: BaiduPan > Cloud > Binary > Doc > Web)
       ↓
   Web adapter fallback chain:
       Extension (real Chrome + opencli daemon, full auth)
           ↓ fail/unavailable
       CDP (Chrome DevTools Protocol, session reuse)
           ↓ fail/unavailable
       httpx (plain async GET, fastest for static pages)
           ↓ fail/WAF detected
       Playwright (headless Chromium, anti-bot stealth)
       ↓
   Optional: LLM extraction (--schema)
       ↓
   Cache (SQLite + content-addressed files, TTL)
       ↓
   FetchResult → markdown | json | csv | parquet

Router decision chain: (1) match community source adapter (@meta) → dispatch to source; (2) match built-in adapter by URL pattern → dispatch directly; (3) web fallback chain for everything else.


Configuration

Config lives at ~/.maestro-fetch/config.toml. Generate with mfetch config init.

[cache]
max_size = "5GB"
default_ttl = 86400

[backends]
priority = ["extension", "cdp", "playwright"]

[backends.extension]
enabled = true
port = 19825

[backends.cdp]
endpoint = "http://127.0.0.1:9222"

Storage: ~/.maestro-fetch/ contains config.toml, cache.db, cache/, sources/, custom/, auth/.


Roadmap

0.3.x — Polish

  • Streaming output — yield chunks as they arrive for long pages and large PDFs
  • MCP server — expose mfetch as an MCP tool for any agent (FastMCP)
  • Retry with backoff — configurable retry policy per adapter
  • mfetch pipe — stdin/stdout piping for Unix composability

0.4.x — Power

  • Parallel batch with progress — tqdm progress bar, per-URL status reporting
  • Diff modemfetch diff <url> compares cached vs live content, shows delta
  • Schema library — pre-built extraction schemas for common pages (arXiv, PubMed, SEC filings, ...)
  • Proxy rotation — SOCKS5/HTTP proxy support for high-volume scraping

1.0 — Fetch Anything

Any URI scheme → mfetch <uri> → clean structured output.

  • Databasemfetch postgres://... / mfetch bigquery://... → DataFrame
  • Cloud objectsmfetch s3://bucket/key / mfetch gs://... / mfetch az://...
  • FTP/SFTPmfetch sftp://host/path
  • Emailmfetch imap://... → extract attachments and body
  • Torrentmfetch magnet:?xt=...
  • IPFSmfetch ipfs://Qm...
  • Real-time feedsmfetch ws://... / mfetch mqtt://...
  • Plugin marketplacemfetch plugin install <name>
  • Watch modemfetch watch <url> --interval 5m with change detection

Contributing

Core improvements -- open issues and PRs on this repo.

New source adapters -- add a Python file to src/maestro_fetch/sources/community/. Each adapter is a single file with an @meta header and an async def run(ctx, ...) function.


License

MIT


Built by Maestro — Singapore AI product studio.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maestro_fetch-0.2.5.tar.gz (410.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maestro_fetch-0.2.5-py3-none-any.whl (100.8 kB view details)

Uploaded Python 3

File details

Details for the file maestro_fetch-0.2.5.tar.gz.

File metadata

  • Download URL: maestro_fetch-0.2.5.tar.gz
  • Upload date:
  • Size: 410.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for maestro_fetch-0.2.5.tar.gz
Algorithm Hash digest
SHA256 4d7f2241355fc61c061e0af93eb2a6c94e1611adde7f8e2e5c80528812be5c91
MD5 9623b8df5e3c52a1a78cea5d81c1b90d
BLAKE2b-256 86b5d7d8429685c8d182944a095f9a7bfc17519c6723f9311b0aabaab24b9f91

See more details on using hashes here.

File details

Details for the file maestro_fetch-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: maestro_fetch-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 100.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for maestro_fetch-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 820ccdc5f4f150a19bfef71eb2ab26538b8718762c55c6bd0f98cce7449cf058
MD5 574dba189a99f5537298855cd18a8eca
BLAKE2b-256 a48aeeead4bcff3a2778f991e650aab3d5498c7b0804ba63ecd835084360edc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page