Research-data acquisition MCP — find and fetch datasets across archives, omics registries, and literature

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

🔎 data-aggregator-mcp

One MCP server to find and fetch research data across archives, omics registries, and literature — behind a single normalized model.

search one query across Zenodo, DataCite (Dryad / Figshare / Dataverse / OSF / Mendeley), NCBI omics (GEO / SRA / BioProject), and literature (PubMed / OpenAIRE) — deduplicated, normalized, and cross-linked. resolve any hit to its file manifest, citation, and the data it points at. fetch it to disk with checksum verification.

mcp-name: io.github.musharna/data-aggregator-mcp

data-aggregator-mcp stdio demo — initialize, tools/list (search, resolve, fetch, list_sources), and a live list_sources call showing the four wired sources

✨ Why this

Most data MCPs wrap a single source. This one unifies them behind four tools and one DataResource model, so an agent searches once and gets back comparable records:

Multi-domain, one model — generalist archives + raw omics + literature, deduplicated by DOI (the fetchable record wins over bare metadata).
Taxonomy synonym expansion — organism="Orobanche aegyptiaca" also matches Phelipanche aegyptiaca (NCBI Taxonomy), so a species rename doesn't cost you results.
Paper → data bridge — resolve a paper and get links to the GEO / SRA / BioProject / DataCite records it produced.
Verified fetch — streams to disk with md5 verification where the source exposes a checksum, optional archive unpacking, and a fail-loud integrity sniff that rejects an HTML paywall page served as a "PDF".
Citations, access & full text — render a citation in any CSL style, get normalized access/license, and pull open-access full text — all in one resolve.

⚡ Quickstart

Run with no install:

uvx data-aggregator-mcp

claude mcp add data-aggregator -- uvx data-aggregator-mcp

A typical agent flow:

search("drought stress RNA-seq", organism="Sorghum bicolor")
  → [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ]   # deduped, taxa-normalized

resolve("sra:SRX079566")
  → DataResource{ files: [ENA FASTQ urls…], access: "open", taxa: [...] }

fetch("sra:SRX079566", dest="./data")
  → ["./data/SRX079566_1.fastq.gz", …]                   # md5-verified

Other ways to run (pip, python -m, raw client config)

pip install data-aggregator-mcp
data-aggregator-mcp        # or: python -m data_aggregator_mcp

Add to a client's MCP config (e.g. Claude Desktop claude_desktop_config.json):

{
  "mcpServers": {
    "data-aggregator": {
      "command": "uvx",
      "args": ["data-aggregator-mcp"],
      "env": { "NCBI_API_KEY": "your-optional-key" }
    }
  }
}

🗂️ Sources

Source	Discover	Fetch	Checksum
Zenodo	✅	✅	md5
DataCite → Figshare	✅	✅	md5
DataCite → Dataverse	✅	✅	md5
DataCite → OSF	✅	✅	md5
DataCite → Dryad	✅	manifest only¹	sha-256 (listed)
DataCite → Mendeley & others	✅	—	—
NCBI SRA	✅	✅ (ENA FASTQ)	md5
NCBI GEO	✅	✅ (`suppl/`)	none²
NCBI BioProject	✅	→ SRA links	—
PubMed / OpenAIRE	✅	✅ (OA full text)	none²

¹ Dryad downloads are token / bot-challenge gated, so fetch fails loud; resolve still lists the files. ² No upstream checksum — fetch verifies content-type instead (rejects an HTML page served in place of a binary).

🛠️ Tools

`search(query, size?, sources?, organism?)`

Fan out across all wired sources in parallel and return compact DataResource records, deduped by DOI. Per-source failures land in errors{} — never silently dropped.

organism — expand the query with NCBI-Taxonomy synonyms; the expansion is echoed in taxon_expansion, and results carry normalized taxa[] ({taxid, name}) plus a described_in link to plant-genomics-mcp for plant taxa.
sources — restrict the fan-out, e.g. ["omics"].
size — max results (1–50).

`resolve(id)`

Full record + files manifest. Routes by id shape — zenodo:7654321, a bare DOI, datacite:10.5061/dryad.x, an omics id (sra:SRX079566, geo:GSE332789, bioproject:PRJNA1468572), or a literature id (pubmed:34320281, openaire:<id>). Attaches, where available:

files[] — ENA FASTQ manifest (SRA), GEO suppl/, or the host repo's native manifest (Figshare / Dataverse / OSF / Dryad).
links[] — paper → data: pubmed: → sra: / geo: / bioproject: (NCBI elink); openaire: → datacite: (ScholeXplorer Scholix).
access / license — normalized status (open / embargoed / restricted / closed / unknown) and license where the source exposes it.
identifiers — normalized {pmid, pmcid, doi}, plus an open-access full-text FileEntry (EuropePMC XML, or an Unpaywall PDF fallback) for papers.
citation — pass cite=<format>: bibtex, ris, csl-json, or any CSL style name (apa, mla, vancouver, …). DOI records use content negotiation; others render CSL-JSON from metadata. Off by default; failures degrade quietly.

`fetch(id, dest?, files?, max_bytes?, force?, extract?)`

Download files to disk and return their paths. Streams under a max_bytes guard (force to override) with md5 verification wherever a checksum exists.

files — restrict to a subset of the resolved manifest.
extract — unpack downloaded zip / tar archives in place, guarded against path traversal and runaway extracted size. Off by default.
Unverified fetches (GEO suppl/, literature full text) get a content-type sniff that fails loud if a declared binary is actually an HTML page.
Fetchable: Zenodo, SRA, GEO, DataCite-hosted Figshare / Dataverse / OSF, and literature open-access full text. Dryad and other DataCite repos are discovery-only and raise FetchNotSupportedError.

`list_sources()`

Wired sources with their capabilities — layer, kinds, supported filters, fetchability, id examples, auth, and rate limits.

⚙️ Configuration

Both optional, set via environment variables:

NCBI_API_KEY — raises the NCBI E-utilities rate limit (3 → 10 req/s) used by the omics, literature, and taxonomy lookups.
UNPAYWALL_EMAIL — enables the Unpaywall fallback leg of literature full-text retrieval (the EuropePMC leg works without it).

🧪 Develop

uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q   # real-API probes

The README demo (examples/assets/demo.svg) is recorded network-free from examples/_demo_stdio.py — see the header of that file to re-record.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

musharna

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.11.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_aggregator_mcp-0.11.0.tar.gz (136.4 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_aggregator_mcp-0.11.0-py3-none-any.whl (49.4 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file data_aggregator_mcp-0.11.0.tar.gz.

File metadata

Download URL: data_aggregator_mcp-0.11.0.tar.gz
Upload date: May 30, 2026
Size: 136.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_aggregator_mcp-0.11.0.tar.gz
Algorithm	Hash digest
SHA256	`78087856afcc0f11c6e1966577d813ee557066e6462a10151d75800c11cb110e`
MD5	`526437873e6f2a29f83814701243eb35`
BLAKE2b-256	`d04a908b9d0dfd66c4cfc77638c3edb227313fdbc02e15f6eff4dbb739268aa5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_aggregator_mcp-0.11.0.tar.gz:

Publisher: publish.yml on musharna/data-aggregator-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: data_aggregator_mcp-0.11.0.tar.gz
- Subject digest: 78087856afcc0f11c6e1966577d813ee557066e6462a10151d75800c11cb110e
- Sigstore transparency entry: 1674233592
- Sigstore integration time: May 30, 2026
Source repository:
- Permalink: musharna/data-aggregator-mcp@84ee063bdc515fdd78c5dcc8045cd7822b8c3fa7
- Branch / Tag: refs/tags/v0.11.0
- Owner: https://github.com/musharna
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@84ee063bdc515fdd78c5dcc8045cd7822b8c3fa7
- Trigger Event: release

File details

Details for the file data_aggregator_mcp-0.11.0-py3-none-any.whl.

File metadata

Download URL: data_aggregator_mcp-0.11.0-py3-none-any.whl
Upload date: May 30, 2026
Size: 49.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_aggregator_mcp-0.11.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0bab6232fefc4aace468ed8372fdeb4bf347ba6f68ed2bd912d4146bf0d0970c`
MD5	`09538ee55ddcb72b3b13b2353b3b2903`
BLAKE2b-256	`cd499629c1ddc4909961b5c8ede6083a07d11089bad93d12e22cd5e480a6d4dd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_aggregator_mcp-0.11.0-py3-none-any.whl:

Publisher: publish.yml on musharna/data-aggregator-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: data_aggregator_mcp-0.11.0-py3-none-any.whl
- Subject digest: 0bab6232fefc4aace468ed8372fdeb4bf347ba6f68ed2bd912d4146bf0d0970c
- Sigstore transparency entry: 1674233602
- Sigstore integration time: May 30, 2026
Source repository:
- Permalink: musharna/data-aggregator-mcp@84ee063bdc515fdd78c5dcc8045cd7822b8c3fa7
- Branch / Tag: refs/tags/v0.11.0
- Owner: https://github.com/musharna
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@84ee063bdc515fdd78c5dcc8045cd7822b8c3fa7
- Trigger Event: release

data-aggregator-mcp 0.11.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🔎 data-aggregator-mcp

✨ Why this

⚡ Quickstart

🗂️ Sources

🛠️ Tools

search(query, size?, sources?, organism?)

resolve(id)

fetch(id, dest?, files?, max_bytes?, force?, extract?)

list_sources()

⚙️ Configuration

🧪 Develop

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`search(query, size?, sources?, organism?)`

`resolve(id)`

`fetch(id, dest?, files?, max_bytes?, force?, extract?)`

`list_sources()`