An MCP server built on crawl4ai for reliable webpage extraction

Project description

crawl4ai-mcp

A minimal MCP server for agent-friendly web extraction and search.

Two tools: fetch real pages with Playwright + Crawl4AI, or search across 7 engines with automatic fallback.

Quick entry

Audience	Read this
Human developer	README.zh-CN.md / README.md
Living in the AI era, delegating your remaining sanity to an agent	README_AGENT.md

At a glance

Item	Reality in this repo
MCP tools	2 tools: `fetch_urls` + `search_web`
Single-page fetch	`urls: ["https://example.com"]`
Web search	`search_web(query="...", engine="auto")` — 7 engines, auto fallback
Search engines	DuckDuckGo · Bing · Google · Yandex · Sogou · 360Search · Baidu
Output	`title + content + links + blocked + llm_used/llm_error`
Non-LLM mode	First-class, default, usable without any model
LLM mode	Off by default. Enabled only with `use_llm=true` + optional `llm_instruction`
Fallback	Missing/failed LLM call automatically falls back to non-LLM result
Anti-bot realism	proxy / cookies / persistent profile / randomized browser behavior
License	AGPL-3.0-or-later

How it works

Fetch flow:

flowchart LR
    A[URL list] --> B[Playwright + Crawl4AI]
    B --> C{Fast path enough?}
    C -- Yes --> D[Markdown / HTML]
    C -- No --> E[Stronger fallback]
    E --> D
    D --> F{use_llm?}
    F -- No --> G[Return result]
    F -- Yes --> H[OpenAI-compatible cleanup]
    H --> I{LLM success?}
    I -- Yes --> J[Return enhanced result]
    I -- No --> G

Search flow:

flowchart LR
    A[query + engine] --> B{engine=auto?}
    B -- Yes --> C[Detect language]
    C --> D[Build engine plan]
    B -- No --> E[Use specified engine]
    D --> F[Try engines in order]
    E --> F
    F --> G{Results?}
    G -- Yes --> H[Aggregate + deduplicate]
    G -- No, next engine --> F
    H --> I[Return results]

Why this project exists

Most generic “web fetch” tools either fail on JS-heavy pages or return too much boilerplate. This project focuses on four things:

Non-LLM quality first: usable even with zero model config
Minimal MCP surface: easier for agents, easier to maintain
Pragmatic anti-bot workflow: proxy / cookies / persistent profile are first-class
Golden regression review: full markdown outputs can be saved and inspected page by page

Core capabilities

Non-LLM mode

Capability	Actual behavior
Rendering	Real browser rendering via Playwright
Extraction	Crawl4AI markdown/html extraction
Fallback	Fast path → stronger path when content is too thin
Cleanup	Remove obvious noise, compress blanks, strip data-image placeholders
Site tuning	Medium / Claude Docs / GitHub and other mainstream sites
Block detection	`blocked=true` for likely verification/interstitial output
Batch control	Bounded concurrency via `concurrency`

Optional LLM mode

Input	Meaning
`use_llm=true`	Turn on post-cleanup with an OpenAI-compatible model
`llm_instruction`	Tell the model what to keep / remove

Important reality check:

With llm_instruction, the prompt is constraint-heavy and biased toward preserving original lines.
Without llm_instruction, the model does a more generic “clean readable markdown” pass.
If the LLM call fails for any reason, the tool returns the original non-LLM extraction plus llm_used=false and llm_error.

MCP Tools

`fetch_urls`

{
  "urls": ["https://a.com", "https://b.com"],
  "format": "markdown",
  "max_chars": 200000,
  "concurrency": 3,
  "use_llm": false,
  "llm_instruction": "keep only the tutorial body and in-body references"
}

Use a single-element list if you only need one page.

Return shape

Field	Meaning
`url`	Original URL
`final_url`	Final resolved URL after redirects
`title`	Extracted title
`content`	Markdown or HTML
`content_format`	`markdown` or `html`
`links`	Normalized extracted links
`blocked`	Likely anti-bot / verification / denied result
`llm_used`	Whether LLM enhancement was actually applied
`llm_error`	Why the LLM step degraded

`search_web`

{
  "query": "crawl4ai web scraping",
  "engine": "auto",
  "max_results": 10,
  "lang": ""
}

Parameter	Default	Description
`query`	(required)	Search query string
`engine`	`auto`	Engine to use: `auto`, `google`, `bing`, `duckduckgo`, `baidu`
`max_results`	`10`	Maximum number of results
`lang`	`""`	Language hint (e.g. `en`, `zh-CN`)

When engine="auto", the server tries engines in fallback order: DuckDuckGo → Bing → Google → Baidu. The first engine that returns results wins.

Search return shape

Field	Meaning
`engine`	Which engine actually returned results
`query`	Original query
`results`	List of `{title, url, snippet}`
`total`	Number of results
`fallback_engines_tried`	Engines that failed before the successful one

Anti-bot realism

The server already includes randomized browser behavior in code:

Mechanism	Actual status
Random viewport	Yes
Random user agent mode	Yes, when explicit UA is not provided
Delay jitter	Yes
`override_navigator`	Yes
`simulate_user`	Yes, in stronger fallback mode
Proxy / cookies / persistent profile	Supported via env vars
Cloudflare bypass	Enhanced browser fingerprinting + configurable wait strategies

Note: For overseas websites (Medium, ProductHunt, etc.), using a proxy is recommended. The server supports HTTP/HTTPS/SOCKS5 proxies via CRAWL4AI_MCP_PROXY environment variable.

Proxy input formats

CRAWL4AI_MCP_PROXY accepts all of these:

Input	Interpreted as
`http://127.0.0.1:7890`	HTTP proxy
`https://127.0.0.1:7890`	HTTPS proxy
`socks5://127.0.0.1:7890`	SOCKS5 proxy
`socket5://127.0.0.1:7890`	Auto-normalized to `socks5://...`
`127.0.0.1:7890`	Auto-normalized to `http://127.0.0.1:7890`
`7890`	Auto-normalized to `http://127.0.0.1:7890`

That means the README should not claim “perfect stealth”, but it can honestly claim human-like randomization and practical anti-bot knobs.

Quickstart

Conda

conda env create -f environment.yml
conda activate crawl4ai-mcp
python -m playwright install
crawl4ai-mcp

venv

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e '.[dev]'
python -m playwright install
crawl4ai-mcp

MCP server config example

{
  "mcpServers": {
    "crawl4ai": {
      "command": "crawl4ai-mcp",
      "env": {
        "CRAWL4AI_MCP_HEADLESS": "true",
        "CRAWL4AI_MCP_PROXY": "127.0.0.1:7890",
        "CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS": "30000",
        "CRAWL4AI_MCP_WAIT_UNTIL": "load",

        "OPENAI_BASE_URL": "https://your-openai-compatible-host",
        "OPENAI_API_KEY": "your-api-key",
        "OPENAI_MODEL": "your-model-name"
      }
    }
  }
}

LLM-related env vars are optional. use_llm is still off by default at call time. If any LLM env is missing, invalid, or the model call fails, the server automatically falls back to non-LLM extraction.

Runtime configuration

Env var	Purpose
`CRAWL4AI_MCP_HEADLESS`	Run browser headless
`CRAWL4AI_MCP_PROXY`	Upstream proxy, supports `http://`, `https://`, `socks5://`, `host:port`, and `port-only`
`CRAWL4AI_MCP_COOKIES_JSON`	Playwright storage state JSON
`CRAWL4AI_MCP_USE_PERSISTENT_CONTEXT`	Reuse browser profile
`CRAWL4AI_MCP_USER_DATA_DIR`	Profile directory
`CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS`	Default max single navigation wait, default `30000`
`CRAWL4AI_MCP_WAIT_UNTIL`	Default page readiness strategy, default `load`
`OPENAI_BASE_URL`	OpenAI-compatible base URL
`OPENAI_API_KEY`	API key
`OPENAI_MODEL`	Model name

Golden smoke regression

CRAWL4AI_MCP_SMOKE_DIR=./_golden_outputs .venv/bin/python -m crawl4ai_mcp.smoke_golden

This writes full markdown outputs to _golden_outputs/ so you can inspect extraction quality page by page.

The golden set now includes the earlier baseline URLs plus ainew.me, openclaw, watcha, producthunt, mydrivers, caihongtu, openrouter, and mobile Douban. For sites outside mainland China, proxy-based verification is recommended.

Some overseas sites may still return Cloudflare or similar verification pages even when a proxy is configured. In those cases the server now marks them with blocked=true. The recommended path is: better proxy quality, valid cookies, or a persistent browser profile after manual verification.

Prior art

Crawl4AI: https://github.com/unclecode/crawl4ai
mcp-crawl4ai-rag: https://github.com/coleam00/mcp-crawl4ai-rag
weidwonder/crawl4ai-mcp-server: https://github.com/weidwonder/crawl4ai-mcp-server
WaterCrawl: https://github.com/watercrawl/WaterCrawl
teracrawl: https://github.com/BrowserCash/teracrawl

License

This project is licensed under AGPL-3.0-or-later.

Project details

Release history Release notifications | RSS feed

0.3.5

Jun 16, 2026

0.3.1

Mar 25, 2026

0.2.3

Mar 12, 2026

This version

0.2.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl4agent-0.2.0.tar.gz (33.2 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawl4agent-0.2.0-py3-none-any.whl (32.5 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file crawl4agent-0.2.0.tar.gz.

File metadata

Download URL: crawl4agent-0.2.0.tar.gz
Upload date: Mar 11, 2026
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for crawl4agent-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`abbff6c9e9eea7ce77c89aa82b8705b3f0caa02b2c00a8c88a4854c39ade73e4`
MD5	`6b8d5b816ae40054eee9cc9b121e09dd`
BLAKE2b-256	`ec44f10c52aa92a8f812be4514412aa2365632a2fec0343232bf1f90d40a7644`

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawl4agent-0.2.0.tar.gz:

Publisher: publish.yml on pazyork/crawl4ai-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: crawl4agent-0.2.0.tar.gz
- Subject digest: abbff6c9e9eea7ce77c89aa82b8705b3f0caa02b2c00a8c88a4854c39ade73e4
- Sigstore transparency entry: 1079105015
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: pazyork/crawl4ai-mcp@b0e6ba1739366420577e0ec5920010d9ad8e48d4
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/pazyork
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b0e6ba1739366420577e0ec5920010d9ad8e48d4
- Trigger Event: release

File details

Details for the file crawl4agent-0.2.0-py3-none-any.whl.

File metadata

Download URL: crawl4agent-0.2.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 32.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for crawl4agent-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9261da62b37f761867f5ee52e7a98666d9a54c2cb628d64f6af42c6534dfbd08`
MD5	`dd191d10928d5c42954d420c290ed0b6`
BLAKE2b-256	`357c0697b8fbfa91f45a1773f29d84b3a170790ffc4acd4ace6f0950eab82233`

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawl4agent-0.2.0-py3-none-any.whl:

Publisher: publish.yml on pazyork/crawl4ai-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: crawl4agent-0.2.0-py3-none-any.whl
- Subject digest: 9261da62b37f761867f5ee52e7a98666d9a54c2cb628d64f6af42c6534dfbd08
- Sigstore transparency entry: 1079105020
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: pazyork/crawl4ai-mcp@b0e6ba1739366420577e0ec5920010d9ad8e48d4
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/pazyork
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b0e6ba1739366420577e0ec5920010d9ad8e48d4
- Trigger Event: release

crawl4agent 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

crawl4ai-mcp

Quick entry

At a glance

How it works

Why this project exists

Core capabilities

Non-LLM mode

Optional LLM mode

MCP Tools

fetch_urls

Return shape

search_web

Search return shape

Anti-bot realism

Proxy input formats

Quickstart

Conda

venv

MCP server config example

Runtime configuration

Golden smoke regression

Prior art

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`fetch_urls`

`search_web`