Web research MCP server: search, fetch, academic, twitter, and compound research tools

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

phimonic

These details have not been verified by PyPI

Project links

Homepage

Project description

mcp-research

MCP server for web research, academic papers, Twitter/X, YouTube, and file ingestion. Eight tools for AI assistants — all via the MCP stdio protocol. Includes credential vault for institutional access, CAPTCHA detection, and token-efficient output.

Tools

Tool	Description
`web_search`	3-tier search cascade: Brave API → DuckDuckGo → HTML scraper
`fetch_url`	Fetch any URL → clean markdown, with SSRF protection and 24h cache
`research`	Compound pipeline: query rewrite → search → parallel fetch → summarize → synthesize
`youtube_essence`	YouTube video → transcript, summary, key points, chapters, quotes
`deep_ingest`	Extract text from files: PDF, DOCX, XLSX, PPTX, audio, video, images
`academic_lookup`	Resolve DOI / ArXiv / PubMed → metadata + full text via institutional access
`twitter_extract`	Extract tweets and threads from X.com/Twitter
`vault_status`	Show loaded credential profiles and dependency status (never exposes secrets)

All tools are read-only — they fetch and transform content, never modify anything.

Install

pip install mcp-research

Or run directly with uvx (zero-install):

uvx mcp-research

Optional extras:

pip install 'mcp-research[twitter]'    # yt-dlp for Twitter extraction
pip install 'mcp-research[youtube]'    # yt-dlp + faster-whisper for YouTube
pip install 'mcp-research[academic]'   # PyPDF2 for academic PDFs
pip install 'mcp-research[ingest]'     # PDF, DOCX, XLSX, PPTX, audio support
pip install 'mcp-research[all]'        # everything

Check your setup:

mcp-research doctor

Usage with Claude Code

Add to your Claude Code MCP config (~/.claude/settings.json or project .mcp.json):

{
  "mcpServers": {
    "research": {
      "command": "uvx",
      "args": ["mcp-research"],
      "env": {
        "BRAVE_API_KEY": "BSA...",
        "OLLAMA_URL": "http://localhost:11434"
      }
    }
  }
}

Usage with Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "research": {
      "command": "uvx",
      "args": ["mcp-research"],
      "env": {
        "BRAVE_API_KEY": "BSA..."
      }
    }
  }
}

Configuration

All configuration is via environment variables — no config files needed (except the optional vault).

Variable	Default	Description
`BRAVE_API_KEY`	(empty)	Brave Search API key. Falls back to DuckDuckGo if unset.
`OLLAMA_URL`	`http://localhost:11434`	Ollama endpoint for summarization/synthesis. Set empty to disable.
`OLLAMA_MODEL`	`qwen2.5:14b`	Model to use for summarization and synthesis.
`MCP_RESEARCH_CACHE_DIR`	`~/.mcp-research/cache/`	URL fetch cache directory.
`MCP_RESEARCH_CACHE_TTL`	`24`	Cache TTL in hours.
`MCP_RESEARCH_LOG_DIR`	`~/.mcp-research/logs/`	Search log directory (NDJSON).
`MCP_RESEARCH_MAX_RESULTS`	`10`	Default max search results.
`MCP_RESEARCH_VAULT_FILE`	`~/.mcp-research/vault.yaml`	Credential vault file path.
`MCP_RESEARCH_VAULT_HOT_RELOAD`	`true`	Auto-reload vault when file changes.
`MCP_RESEARCH_SESSION_TTL`	`1800`	Session idle timeout in seconds.

Tool Details

`web_search`

web_search(query, max_results=5, summarize=False, auto_fetch_top=False)

Searches the web using a 3-tier cascade for maximum reliability:

Brave Search API — fast, high quality (requires BRAVE_API_KEY)
DuckDuckGo library — no API key needed, retries on rate limit
DuckDuckGo HTML scraper — last-resort fallback

Options:

summarize: Use Ollama to summarize results (requires running Ollama)
auto_fetch_top: Also fetch and return the full content of the top result

`fetch_url`

fetch_url(url, summarize=False, max_chars=15000)

Fetches a URL and converts it to clean markdown:

SSRF protection: Blocks localhost, private IPs, non-HTTP schemes
Smart retry: Exponential backoff on 429/5xx, per-hop redirect validation
24h cache: SHA-256 keyed, configurable TTL
Content support: HTML → markdown, JSON → code block, binary → rejected
Smart truncation: Breaks at heading/paragraph boundaries, not mid-text
CAPTCHA detection: Flags Cloudflare, hCaptcha, reCAPTCHA, Akamai walls
Token-efficient: Default 15K chars (~4K tokens), adjustable via max_chars

`research`

research(query, depth="standard", context="")

Compound research pipeline:

Query rewrite — Ollama optimizes your question into search keywords
Web search — finds relevant pages (with zero-result retry expansion)
Parallel fetch — fetches top N pages concurrently
Summarize — Ollama summarizes each page
Synthesize — Ollama produces a final cited answer

Depth levels:

Depth	Pages	Synthesis
`quick`	2	No
`standard`	5	Yes
`deep`	10	Yes

All steps gracefully degrade without Ollama — you still get search results and page content.

`youtube_essence`

youtube_essence(url, mode="standard")

Extracts structured content from YouTube videos:

Transcript: Auto-subtitles or Whisper transcription (local, private)
Summary: AI summary via Ollama
Key points: Bullet-point takeaways
Chapters: Timestamped segments
Quotes: Notable quotations (deep mode)

Modes: quick (TL;DR), standard (+ chapters), deep (+ quotes)

Requires yt-dlp. Optional: faster-whisper for audio-only videos, ffmpeg for media extraction.

`deep_ingest`

deep_ingest(path, include_types="", max_files=200, summarize=False)

Extracts text from files in a directory or single file:

Text files: .txt, .md, .json, .csv, source code, etc.
PDF: Via PyPDF2 (optional dependency)
Office: .docx, .xlsx, .pptx (optional dependencies)
Audio/Video: Whisper transcription (optional)
Images: OCR via Ollama vision model (optional)

Type filter: text, pdf, audio, video, image, office

`academic_lookup`

academic_lookup(identifier, fetch_fulltext=True)

Resolves academic papers from multiple identifier types:

DOI: 10.xxxx/... → Crossref metadata + publisher redirect
ArXiv: 2301.12345 → abstract + PDF
PubMed: PMID → E-utilities metadata → DOI chain
URL: Publisher page detection

Full text access via credential vault:

EZproxy rewriting (prefix and suffix modes)
Bearer token, API key, basic auth, cookie jar
Automatic publisher detection (IEEE, Springer, Elsevier, ACM, Wiley, Nature, JSTOR, etc.)

`twitter_extract`

twitter_extract(url, include_thread=False)

Extracts tweets and threads from X.com/Twitter using a strategy cascade:

yt-dlp (primary) — works with cookie jar for authenticated access
Twitter API v2 — if bearer token configured in vault
HTML fetch — cookie-based last resort

Returns: text, author, timestamp, metrics (likes, retweets, replies), media URLs.

`vault_status`

vault_status()

Shows loaded credential profiles, match patterns, and auth types — never exposes secrets. Also checks availability of optional dependencies.

Credential Vault

Create ~/.mcp-research/vault.yaml to configure authentication for protected sources:

version: 1
profiles:
  # University EZproxy for IEEE
  ieee-university:
    match: "*.ieee.org/**"
    ezproxy:
      base_url: "https://ezproxy.myuniversity.edu/login?url="
      mode: prefix

  # Springer via API key
  springer:
    match: "*.springer.com/**"
    auth:
      type: api_key
      header: "X-ApiKey"
      value: "${SPRINGER_API_KEY}"

  # X.com via browser cookies
  twitter:
    match: "*.x.com/**"
    auth:
      type: cookie_jar
      path: "${HOME}/.mcp-research/cookies/twitter.txt"

${VAR} resolved from environment variables — secrets never stored in plain text
First matching profile wins (order matters)
Auth types: bearer, basic, api_key, cookie_jar, headers
EZproxy modes: prefix (prepend base URL) or suffix (domain rewriting)
Hot-reload: vault file changes are picked up automatically

Token Efficiency

All tools produce compact output by default to avoid wasting AI context window tokens:

Tool	Default output	Override
`fetch_url`	~15K chars (~4K tokens)	`max_chars` parameter
`research`	~500 tokens per source	Prefers summaries over raw content
`academic_lookup`	~10K chars full text	Truncates with notice
`deep_ingest`	15 files, 300 char excerpts	`max_files` parameter
`youtube_essence`	3K char transcript excerpt	Full transcript in result object

Safety & Robustness

SSRF protection: Blocks localhost, private IPs, link-local, non-HTTP schemes on every hop
CAPTCHA detection: Identifies Cloudflare, hCaptcha, reCAPTCHA, Akamai, DDoS-Guard walls
Input validation: Size limits, URL validation, safe redirect following
No eval/exec: No dynamic code execution
Vault security: Secrets resolved from env vars, repr() redacts all auth values
Cache isolation: Owner-only directory permissions (0o700)
Graceful degradation: Missing optional deps don't crash — features degrade with clear messages

CLI

mcp-research serve                          # Run MCP stdio server (default)
mcp-research search "query"                 # Search the web
mcp-research fetch https://example.com      # Fetch URL to markdown
mcp-research youtube https://youtu.be/...   # Extract YouTube video
mcp-research ingest ./docs/                 # Extract text from files
mcp-research academic "10.1109/..."         # Resolve academic paper
mcp-research tweet https://x.com/.../123    # Extract tweet
mcp-research vault                          # Show vault profiles
mcp-research doctor                         # Check dependencies

Development

git clone https://github.com/MABAAM/Maibaamcrawler.git
cd Maibaamcrawler
pip install -e ".[all]"
pytest tests/ -v
python -m mcp_research

Changelog

v0.3.0

Credential vault: YAML config at ~/.mcp-research/vault.yaml with env var interpolation, glob URL matching, EZproxy rewriting, hot-reload
Session pooling: Per-domain sessions with vault auth injection, cookie jar support, idle eviction
CAPTCHA detection: Identifies Cloudflare, hCaptcha, reCAPTCHA, Akamai, DDoS-Guard, generic bot walls
Academic lookup: DOI/ArXiv/PubMed resolution, Crossref metadata, institutional full text access via vault
Twitter/X extraction: yt-dlp, API v2, and cookie-based access with thread support
Token efficiency: Default output caps (~4K tokens for fetch, ~500 per research source) to preserve AI context
Doctor command: mcp-research doctor checks all dependencies and configuration
Windows encoding fix: UTF-8 stdout/stderr wrapper prevents cp1252 crashes

v0.2.0

YouTube essence: Transcript extraction, AI summary, key points, chapters, quotes
Deep ingest: PDF, DOCX, XLSX, PPTX, audio, video, image text extraction
Ollama integration: Query rewriting, summarization, synthesis, vision OCR
Search logging: NDJSON event log for all operations
Brave Search: Primary search tier with API key support

v0.1.0

Initial release: 3 tools (web_search, fetch_url, research), SSRF protection, caching

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

phimonic

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.0

Apr 9, 2026

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_research-0.3.0.tar.gz (355.1 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_research-0.3.0-py3-none-any.whl (48.9 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file mcp_research-0.3.0.tar.gz.

File metadata

Download URL: mcp_research-0.3.0.tar.gz
Upload date: Apr 9, 2026
Size: 355.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcp_research-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9c0bbc79448b74a57b537e22af109234a8302f61a19d8c343320c72ffa871d5b`
MD5	`09254a5201e07a8276a6743b834a7f39`
BLAKE2b-256	`4725914a63d05339b6c8365ff6b2683b196bab37dfdb8b0ec1c6872410a71020`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_research-0.3.0.tar.gz:

Publisher: publish.yml on MABAAM/Maibaamcrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcp_research-0.3.0.tar.gz
- Subject digest: 9c0bbc79448b74a57b537e22af109234a8302f61a19d8c343320c72ffa871d5b
- Sigstore transparency entry: 1262560722
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: MABAAM/Maibaamcrawler@5042df0391edff09dd928912ca1bb952fce54801
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/MABAAM
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5042df0391edff09dd928912ca1bb952fce54801
- Trigger Event: release

File details

Details for the file mcp_research-0.3.0-py3-none-any.whl.

File metadata

Download URL: mcp_research-0.3.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 48.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcp_research-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40fd2a77e94ce52a1a028ff4133580bd4a5a12b07242f0d9b6f25e49c3345f42`
MD5	`6ab2c5f4441fa6a9fa21e824cc44fd4f`
BLAKE2b-256	`3872a7b9f0dfe7929087b2f8574d1f49d10facb5d6728edaad05df5c69b1230d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_research-0.3.0-py3-none-any.whl:

Publisher: publish.yml on MABAAM/Maibaamcrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcp_research-0.3.0-py3-none-any.whl
- Subject digest: 40fd2a77e94ce52a1a028ff4133580bd4a5a12b07242f0d9b6f25e49c3345f42
- Sigstore transparency entry: 1262560733
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: MABAAM/Maibaamcrawler@5042df0391edff09dd928912ca1bb952fce54801
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/MABAAM
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5042df0391edff09dd928912ca1bb952fce54801
- Trigger Event: release

mcp-research 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mcp-research

Tools

Install

Usage with Claude Code

Usage with Claude Desktop

Configuration

Tool Details

web_search

fetch_url

research

youtube_essence

deep_ingest

academic_lookup

twitter_extract

vault_status

Credential Vault

Token Efficiency

Safety & Robustness

CLI

Development

Changelog

v0.3.0

v0.2.0

v0.1.0

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`web_search`

`fetch_url`

`research`

`youtube_essence`

`deep_ingest`

`academic_lookup`

`twitter_extract`

`vault_status`