Skip to main content

Research-data acquisition MCP — find and fetch datasets across archives, omics registries, and literature

Project description

🔎 data-aggregator-mcp

One MCP server to find and fetch research data across archives, omics registries, and literature — behind a single normalized model.

PyPI Python License: MIT CI

search one query across Zenodo, DataCite (Dryad / Figshare / Dataverse / OSF / Mendeley), NCBI omics (GEO / SRA / BioProject), and literature (PubMed / OpenAIRE) — deduplicated, normalized, and cross-linked. resolve any hit to its file manifest, citation, and the data it points at. fetch it to disk with checksum verification.

mcp-name: io.github.musharna/data-aggregator-mcp

data-aggregator-mcp stdio demo — initialize, tools/list (search, resolve, fetch, list_sources), and a live list_sources call showing the four wired sources

✨ Why this

Most data MCPs wrap a single source. This one unifies them behind four tools and one DataResource model, so an agent searches once and gets back comparable records:

  • Multi-domain, one model — generalist archives + raw omics + literature, deduplicated by DOI (the fetchable record wins over bare metadata).
  • Taxonomy synonym expansionorganism="Orobanche aegyptiaca" also matches Phelipanche aegyptiaca (NCBI Taxonomy), so a species rename doesn't cost you results.
  • Paper → data bridge — resolve a paper and get links to the GEO / SRA / BioProject / DataCite records it produced.
  • Verified fetch — streams to disk with md5 verification where the source exposes a checksum, optional archive unpacking, and a fail-loud integrity sniff that rejects an HTML paywall page served as a "PDF".
  • Citations, access & full text — render a citation in any CSL style, get normalized access/license, and pull open-access full text — all in one resolve.

⚡ Quickstart

Run with no install:

uvx data-aggregator-mcp

Register with Claude Code:

claude mcp add data-aggregator -- uvx data-aggregator-mcp

A typical agent flow:

search("drought stress RNA-seq", organism="Sorghum bicolor")
  → [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ]   # deduped, taxa-normalized

resolve("sra:SRX079566")
  → DataResource{ files: [ENA FASTQ urls…], access: "open", taxa: [...] }

fetch("sra:SRX079566", dest="./data")
  → ["./data/SRX079566_1.fastq.gz", …]                   # md5-verified
Other ways to run (pip, python -m, raw client config)
pip install data-aggregator-mcp
data-aggregator-mcp        # or: python -m data_aggregator_mcp

Add to a client's MCP config (e.g. Claude Desktop claude_desktop_config.json):

{
  "mcpServers": {
    "data-aggregator": {
      "command": "uvx",
      "args": ["data-aggregator-mcp"],
      "env": { "NCBI_API_KEY": "your-optional-key" }
    }
  }
}

🗂️ Sources

Source Discover Fetch Checksum
Zenodo md5
DataCite → Figshare md5
DataCite → Dataverse md5
DataCite → OSF md5
DataCite → Dryad manifest only¹ sha-256 (listed)
DataCite → Mendeley & others
NCBI SRA ✅ (ENA FASTQ) md5
NCBI GEO ✅ (suppl/) none²
NCBI BioProject → SRA links
PubMed / OpenAIRE ✅ (OA full text) none²

¹ Dryad downloads are token / bot-challenge gated, so fetch fails loud; resolve still lists the files. ² No upstream checksum — fetch verifies content-type instead (rejects an HTML page served in place of a binary).

🛠️ Tools

search(query, size?, sources?, organism?)

Fan out across all wired sources in parallel and return compact DataResource records, deduped by DOI. Per-source failures land in errors{} — never silently dropped.

  • organism — expand the query with NCBI-Taxonomy synonyms; the expansion is echoed in taxon_expansion, and results carry normalized taxa[] ({taxid, name}) plus a described_in link to plant-genomics-mcp for plant taxa.
  • sources — restrict the fan-out, e.g. ["omics"].
  • size — max results (1–50).

resolve(id)

Full record + files manifest. Routes by id shape — zenodo:7654321, a bare DOI, datacite:10.5061/dryad.x, an omics id (sra:SRX079566, geo:GSE332789, bioproject:PRJNA1468572), or a literature id (pubmed:34320281, openaire:<id>). Attaches, where available:

  • files[] — ENA FASTQ manifest (SRA), GEO suppl/, or the host repo's native manifest (Figshare / Dataverse / OSF / Dryad).
  • links[] — paper → data: pubmed:sra: / geo: / bioproject: (NCBI elink); openaire:datacite: (ScholeXplorer Scholix).
  • access / license — normalized status (open / embargoed / restricted / closed / unknown) and license where the source exposes it.
  • identifiers — normalized {pmid, pmcid, doi}, plus an open-access full-text FileEntry (EuropePMC XML, or an Unpaywall PDF fallback) for papers.
  • citation — pass cite=<format>: bibtex, ris, csl-json, or any CSL style name (apa, mla, vancouver, …). DOI records use content negotiation; others render CSL-JSON from metadata. Off by default; failures degrade quietly.

fetch(id, dest?, files?, max_bytes?, force?, extract?)

Download files to disk and return their paths. Streams under a max_bytes guard (force to override) with md5 verification wherever a checksum exists.

  • files — restrict to a subset of the resolved manifest.
  • extract — unpack downloaded zip / tar archives in place, guarded against path traversal and runaway extracted size. Off by default.
  • Unverified fetches (GEO suppl/, literature full text) get a content-type sniff that fails loud if a declared binary is actually an HTML page.
  • Fetchable: Zenodo, SRA, GEO, DataCite-hosted Figshare / Dataverse / OSF, and literature open-access full text. Dryad and other DataCite repos are discovery-only and raise FetchNotSupportedError.

list_sources()

Wired sources with their capabilities — layer, kinds, supported filters, fetchability, id examples, auth, and rate limits.

⚙️ Configuration

Both optional, set via environment variables:

  • NCBI_API_KEY — raises the NCBI E-utilities rate limit (3 → 10 req/s) used by the omics, literature, and taxonomy lookups.
  • UNPAYWALL_EMAIL — enables the Unpaywall fallback leg of literature full-text retrieval (the EuropePMC leg works without it).

🧪 Develop

uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q   # real-API probes

The README demo (examples/assets/demo.svg) is recorded network-free from examples/_demo_stdio.py — see the header of that file to re-record.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_aggregator_mcp-0.11.0.tar.gz (136.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_aggregator_mcp-0.11.0-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file data_aggregator_mcp-0.11.0.tar.gz.

File metadata

  • Download URL: data_aggregator_mcp-0.11.0.tar.gz
  • Upload date:
  • Size: 136.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_aggregator_mcp-0.11.0.tar.gz
Algorithm Hash digest
SHA256 78087856afcc0f11c6e1966577d813ee557066e6462a10151d75800c11cb110e
MD5 526437873e6f2a29f83814701243eb35
BLAKE2b-256 d04a908b9d0dfd66c4cfc77638c3edb227313fdbc02e15f6eff4dbb739268aa5

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_aggregator_mcp-0.11.0.tar.gz:

Publisher: publish.yml on musharna/data-aggregator-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_aggregator_mcp-0.11.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_aggregator_mcp-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0bab6232fefc4aace468ed8372fdeb4bf347ba6f68ed2bd912d4146bf0d0970c
MD5 09538ee55ddcb72b3b13b2353b3b2903
BLAKE2b-256 cd499629c1ddc4909961b5c8ede6083a07d11089bad93d12e22cd5e480a6d4dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_aggregator_mcp-0.11.0-py3-none-any.whl:

Publisher: publish.yml on musharna/data-aggregator-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page