Fast Python web crawler for AI & RAG ingestion — crawl, extract, and embed website content with one tool.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

AIMLPM

These details have not been verified by PyPI

Project description

MarkCrawl by iD8 🕷️📝

Turn any webpage or website into clean Markdown for LLM pipelines — in one command.

PyPI Version License

pip install markcrawl
markcrawl --base https://docs.example.com --out ./output --show-progress

MarkCrawl is a crawl-and-structure engine. It fetches one page or crawls an entire website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.

Everything else — LLM extraction, Supabase upload, MCP server, LangChain tools — is optional and installed separately.

Want a hosted API instead of running locally? Join the waitlist — we're gauging interest.

LLM agents: Load docs/LLM_PROMPT.md as a system prompt to generate correct MarkCrawl commands automatically.

Installation / Upgrading

Install or upgrade with pip:

pip install --upgrade markcrawl
pip show markcrawl | grep Version       # confirm the installed version
markcrawl --help | head -1              # confirm the binary on $PATH is the upgraded one

If markcrawl --help is missing flags you expect (e.g. --screenshot, --seed-file, --smart-sample, --download-images), your local install is stale. Run pip install --upgrade markcrawl against the same Python that owns the markcrawl binary on your PATH — head -1 $(which markcrawl) shows the right interpreter. PyPI is always the source of truth; see CHANGELOG.md for the full release history.

v0.10 highlights (changelog):

+11.5% MRR / −$10K/yr cost on the 11-site local replica vs v0.9.9. Multi-trial-validated chunker change (chunk_markdown defaults flipped to min_words=250, section_overlap_words=40, strip_markdown_links=True) plus the bake-off-winning embedder default (mixedbread-ai/mxbai-embed-large-v1, local, $0/yr).
Local embedder is the default since v0.10.1 — pip install markcrawl ships the full ML stack (torch + transformers + sentence-transformers). Zero API key required for embedding. Override with MARKCRAWL_EMBEDDER=text-embedding-3-small or the embedding_model kwarg if you want OpenAI back.
Tenacity-backed HTTP retry — full-jitter exponential backoff (2 s → 30 s, 5 attempts) that honors the server's Retry-After header on 429s.

Quickstart (2 minutes)

pip install markcrawl
markcrawl --base https://quotes.toscrape.com --out ./demo --max-pages 5 --show-progress

Your ./demo folder now contains:

demo/
├── index__a4f3b2c1d0.md    ← clean Markdown of the page
├── page-2__b7e2d1f0a3.md
├── ...
└── pages.jsonl              ← structured index (one JSON line per page)

Each line in pages.jsonl:

{
  "url": "https://quotes.toscrape.com/",
  "title": "Quotes to Scrape",
  "crawled_at": "2026-04-04T12:30:00Z",
  "citation": "Quotes to Scrape. quotes.toscrape.com. Available at: https://quotes.toscrape.com/ [Accessed April 04, 2026].",
  "tool": "markcrawl",
  "text": "# Quotes to Scrape\n\n> "The world as we have created it is a process of our thinking..." — Albert Einstein\n\nTags: change, deep-thoughts, thinking, world..."
}

Common Recipes

Scrape a single page:

markcrawl --base https://example.com/pricing --no-sitemap --max-pages 1

Scrape a single JS-rendered page (React, Vue, YouTube, etc.):

markcrawl --base "https://www.youtube.com/@channel/videos" \
  --no-sitemap --max-pages 1 --render-js
# → outputs one .md file with video titles, view counts, and dates

For infinite-scroll pages like YouTube, this captures the first ~28 videos from the initial render.

Crawl a docs site:

markcrawl --base https://docs.example.com --max-pages 500 --concurrency 5 --show-progress

Crawl a subsection without sitemap wandering:

Large sites (YouTube, GitHub, etc.) have sitemaps with thousands of unrelated pages. Use --no-sitemap to crawl only from your target URL:

markcrawl --base https://docs.example.com/guides \
  --no-sitemap --max-pages 50 --show-progress

Competitive analysis (crawl 3 competitors, extract pricing):

markcrawl --base https://competitor-one.com/pricing --no-sitemap --max-pages 1 --out ./comp1
markcrawl --base https://competitor-two.com/pricing --no-sitemap --max-pages 1 --out ./comp2
markcrawl --base https://competitor-three.com/pricing --no-sitemap --max-pages 1 --out ./comp3
markcrawl-extract \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
  --fields pricing_tiers features free_trial --show-progress
# → extracted.jsonl with structured pricing data across all three

Docs site → RAG chatbot (full pipeline: crawl, embed, query — $0 in API charges):

pip install markcrawl markcrawl[upload]            # base install bundles the local embedder
markcrawl --base https://docs.example.com --out ./docs --max-pages 500 --concurrency 5 --show-progress
markcrawl-upload --jsonl ./docs/pages.jsonl --show-progress
# → pages chunked + embedded locally with mxbai-embed-large-v1, uploaded to Supabase/pgvector
# Wire your chatbot to query the vector table — see docs/SUPABASE.md

To use OpenAI embeddings instead (e.g. for parity with an existing index), set MARKCRAWL_EMBEDDER=text-embedding-3-small or pass embedding_model=... to upload(...) / markcrawl-upload.

API docs → code generation prompt:

markcrawl --base https://api.example.com/docs --out ./api-docs --max-pages 200 --show-progress
# Feed the output to an LLM:
# "Using the API documentation in ./api-docs/pages.jsonl, generate a
#  typed Python client with methods for each endpoint."

Back up a blog before it shuts down:

markcrawl --base https://engineering.example.com/blog \
  --no-sitemap --max-pages 1000 --concurrency 5 --out ./blog-archive --show-progress
# → every post saved as clean Markdown with citations and access dates

Skip junk pages (job listings, login walls, SEO spam):

markcrawl --base https://example.com \
  --exclude-path "/job/*" --exclude-path "/careers/*" --exclude-path "/login" \
  --max-pages 500 --out ./output --show-progress

Preview URLs before committing to a long crawl:

markcrawl --base https://example.com --dry-run
# → prints every URL that would be crawled (from sitemap), then exits
# Pipe to wc -l to get a count, or grep to check for junk patterns
markcrawl --base https://example.com --dry-run | wc -l
markcrawl --base https://example.com --dry-run | grep "/job/"

Only crawl specific sections (blog + pricing, ignore everything else):

markcrawl --base https://example.com \
  --include-path "/blog/*" --include-path "/pricing" \
  --max-pages 200 --out ./output --show-progress

Safe crawl of a job board (dry-run + exclude):

# Step 1: see what you'd get
markcrawl --base https://tealhq.com --dry-run | head -50
# Step 2: exclude the job listings, crawl just the content pages
markcrawl --base https://tealhq.com \
  --exclude-path "/job/*" --exclude-path "/resume-examples/*" \
  --max-pages 200 --out ./tealhq --show-progress

Choose an extraction backend:

# Default (BS4 + markdownify) — fastest, good for most sites
markcrawl --base https://docs.example.com --out ./output --show-progress

# Ensemble — runs default + trafilatura, picks best per page
markcrawl --base https://docs.example.com --out ./output --extractor ensemble --show-progress

# ReaderLM-v2 — ML-based extraction (uses the bundled torch + transformers stack since v0.10.1)
markcrawl --base https://docs.example.com --out ./output --extractor readerlm --show-progress

Skip pages you've already crawled (cross-crawl dedup):

# First crawl
markcrawl --base https://docs.example.com --out ./docs --show-progress
# Later — only fetches new/changed pages
markcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progress

Crawl high-value pages first (link prioritization):

markcrawl --base https://docs.example.com --out ./docs \
  --prioritize-links --max-pages 100 --show-progress
# Prioritizes content-rich pages (guides, docs) over low-value ones (legal, login)

Smart-sample a large site (e-commerce, job boards, real estate):

# Preview the pattern clusters first
markcrawl --base https://bigsite.com --dry-run --smart-sample --show-progress
# Crawl with sampling — 5 pages per templated cluster instead of thousands
markcrawl --base https://bigsite.com --out ./bigsite \
  --smart-sample --sample-size 5 --sample-threshold 20 --show-progress

Download images alongside content (photography blogs, product pages):

# Crawl a photography blog and save images from the content area
markcrawl --base https://photography-blog.example.com --out ./photos \
  --download-images --max-pages 50 --show-progress
# Output:
#   ./photos/assets/mountain-abc123.jpg
#   ./photos/assets/sunset-def456.png
#   ./photos/post-1__a1b2c3.md  ← Markdown with ![alt](assets/filename.ext) refs
#   ./photos/pages.jsonl         ← index includes "images" array per page

# Adjust minimum image size to skip thumbnails (default: 5000 bytes)
markcrawl --base https://example.com/gallery --out ./gallery \
  --download-images --min-image-size 20000 --show-progress

Selectively download referenced binaries (PDF, DOCX) — new in v0.11.0:

# Crawl an aggregator site and harvest only the resume templates
# (skips privacy policies, ToS, marketing PDFs by anchor + URL signal).
from markcrawl import crawl
from markcrawl.filters import is_likely_resume

result = crawl(
    base_url="https://example.com/templates",
    out_dir="./resumes",
    download_types=["pdf", "docx"],         # opt-in; default None = no downloads
    download_filter=is_likely_resume,       # pre-fetch — rejected URLs never fetched
    download_max_files=200,                 # cap per crawl
    download_max_size_mb=25,                # per-file cap (streaming)
)
print(f"Saved {result.downloads_count} files, {result.downloads_bytes/1e6:.1f} MB")
# Output:
#   ./resumes/downloads/cv-template-1__a1b2c3.pdf
#   ./resumes/downloads/cover-letter-2__d4e5f6.docx
#   ./resumes/pages.jsonl       ← each page's row gets "downloads": [{url, path, ...}]

Filters are pure functions of DownloadCandidate(url, anchor_text, parent_url, parent_title, extension) — compose your own with the bundled starters:

from markcrawl.filters import is_likely_resume, exclude_legal_boilerplate

# Combine a positive selector with the negative one:
def my_filter(c):
    return is_likely_resume(c) and exclude_legal_boilerplate(c) and "spam" not in c.url

crawl(..., download_types=["pdf"], download_filter=my_filter)

Bundled filters (markcrawl.filters): is_likely_resume, is_likely_paper, exclude_legal_boilerplate. Best-effort heuristics, not classifiers — test against your real corpus before relying on them.

Capture page screenshots (dashboards, data visualisations, JS-rendered charts):

# Full-page screenshot of every crawled page (auto-enables --render-js)
markcrawl --base https://steamcharts.com/top --out ./dash \
  --screenshot --max-pages 5 --show-progress
# Output:
#   ./dash/screenshots/top-abc123def456.png   ← 1920-wide full-page PNG
#   ./dash/pages.jsonl                        ← each row gets "screenshot": "screenshots/..."

# Crop to just the dashboard region, JPEG for smaller files, longer wait for slow charts
markcrawl --base https://example.com/dashboards --out ./dash \
  --screenshot --screenshot-selector ".dashboard-main" \
  --screenshot-format jpeg --screenshot-wait-ms 3000 --show-progress

The screenshot path loads with wait_until="load" and then pauses --screenshot-wait-ms (default 1500ms) before capturing, so canvas/SVG charts have time to render. (networkidle is deliberately avoided — many real sites never idle due to analytics pings.) Failures are recorded in the JSONL row as screenshot_error rather than aborting the crawl.

Multi-site: discover seed URLs and fan out across sites:

# Use a bundled curated seed pack, then crawl every site with screenshots
markcrawl discover --pack game-dashboards | \
  markcrawl --seed-file - --out ./dashboards \
    --screenshot --max-pages-per-site 5 --show-progress

# List available packs
markcrawl discover --list-packs

Output is organised per-site: ./dashboards/<netloc>/pages.jsonl plus screenshots/ under each. See the full recipe (including a YouTube frame-extraction path using yt-dlp + ffmpeg) at docs/recipes/game-dashboards.md.

Resume an interrupted crawl:

markcrawl --base https://docs.example.com --out ./docs --resume --show-progress

How it compares to other crawlers

Different tools make different tradeoffs. This table summarizes the main differences:

	MarkCrawl	FireCrawl	Crawl4AI	Scrapy
License	MIT	AGPL-3.0	Apache-2.0	BSD-3
Install	`pip install markcrawl`	SaaS or self-host	pip + Playwright	pip + framework
Output	Markdown + JSONL	Markdown + JSON	Markdown	Custom pipelines
JS rendering	Optional (`--render-js`)	Built-in	Built-in	Plugin
LLM extraction	Optional add-on	Via API	Built-in	None
Best for	Single-site crawl → Markdown	Hosted scraping API	AI-native crawling	Large-scale distributed

Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep browser automation, and Scrapy handles massive distributed workloads. MarkCrawl focuses on simple local crawls that produce LLM-ready Markdown.

Benchmark results — last public run (April 2026, v2 methodology)

Tool	Speed (p/s)	Content Signal	MRR	Answer (/5)	Annual cost (100K pages)
markcrawl (v0.9.x)	12.1	99%	0.698	4.52	$4,505
scrapy+md	9.5	93%	0.459	4.03	$5,464
colly+md	4.2	67%	0.677	4.53	$7,213
playwright	2.2	64%	0.727	4.42	$7,320
crawlee	1.7	63%	0.733	4.52	$7,467
crawl4ai	1.5	83%	0.694	4.43	$6,960

v0.10 projection (next public CI run, based on local-replica delta):

Metric	v0.9.x public	v0.10 projected	Δ
Speed	12.1 (1st)	~12.1 (1st)	unchanged
MRR	0.698 (3rd)	~0.78 (1st)	+11% projected, multi-trial validated locally
Content signal	99% (1st)	~99% (1st)	unchanged
Cost / 100K pgs	$4,505 (1st)	$0 (1st)	−$4,505/yr with default local embedder
Answer (/5)	4.52 (tied 2nd)	~4.5	within noise

Drivers: chunk_markdown defaults flipped (Track D, validated +14% MRR on all-MiniLM-L6-v2 and +15% on OpenAI 3-small across 9 trials) plus the bake-off-winning local embedder default (Track B, MRR-neutral vs 3-small with $0 cost-at-scale).

Full benchmark data: docs/BENCHMARKS.md | Methodology: llm-crawler-benchmarks | v0.10 details: bench/local_replica/v010_release_report.md

Installation

pip install markcrawl                # Core crawler + chunker + local embedder
                                     # (no API keys required for embedding)

Optional add-ons (tasks beyond the crawl-and-embed core):

pip install markcrawl[js]            # + JavaScript rendering (Playwright)
pip install markcrawl[extract]       # + LLM extraction (OpenAI, Claude, Gemini, Grok)
pip install markcrawl[upload]        # + Supabase upload integration
pip install markcrawl[mcp]           # + MCP server for AI agents
pip install markcrawl[langchain]     # + LangChain tool wrappers
pip install markcrawl[all]           # Everything

For Playwright, also run playwright install chromium after installing.

Lean install (skip the local-embedder dep stack — you'll need an OPENAI_API_KEY and pass embedding_model="text-embedding-3-small" for any embedding work):

pip install --no-deps markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity

Install from source (for development)

git clone https://github.com/AIMLPM/markcrawl.git
cd markcrawl
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

Crawling

markcrawl --base https://www.example.com --out ./output --show-progress

Add flags as needed:

markcrawl \
  --base https://www.example.com \
  --out ./output \
  --include-subdomains \        # crawl sub.example.com too
  --render-js \                 # render JavaScript (React, Vue, etc.)
  --concurrency 5 \             # fetch 5 pages in parallel
  --proxy http://proxy:8080 \   # route through a proxy
  --max-pages 200 \             # stop after 200 pages
  --format markdown \           # or "text" for plain text
  --show-progress

Resume an interrupted crawl:

markcrawl --base https://www.example.com --out ./output --resume --show-progress

Output

Each page becomes a .md file with a citation header:

# Getting Started

> URL: https://docs.example.com/getting-started
> Crawled: April 04, 2026
> Citation: Getting Started. docs.example.com. Available at: https://docs.example.com/getting-started [Accessed April 04, 2026].

Welcome to the platform. This guide walks you through installation...

Navigation, footer, cookie banners, and scripts are stripped. Only the main content remains.

All crawler CLI arguments

Argument	Description
`--base`	Base site URL to crawl
`--out`	Output directory
`--format`	`markdown` or `text` (default: `markdown`)
`--show-progress`	Print progress and crawl events
`--render-js`	Render JavaScript with Playwright before extracting
`--concurrency`	Pages to fetch in parallel (default: `1`)
`--proxy`	HTTP/HTTPS proxy URL
`--resume`	Resume from saved state
`--include-subdomains`	Include subdomains under the base domain
`--max-pages`	Max pages to save; `0` = unlimited (default: `500`)
`--delay`	Minimum delay between requests in seconds (default: `0`, adaptive throttle adjusts automatically)
`--timeout`	Per-request timeout in seconds (default: `15`)
`--min-words`	Skip pages with fewer words (default: `20`)
`--user-agent`	Override the default user agent
`--use-sitemap` / `--no-sitemap`	Enable/disable sitemap discovery. Use `--no-sitemap` when you want to scrape a specific page or subsection — without it, large sites (YouTube, GitHub) may discover thousands of unrelated pages via their sitemap
`--exclude-path`	Glob pattern to exclude URL paths (e.g. `'/job/*'`). Can be repeated
`--include-path`	Glob pattern to include URL paths (e.g. `'/blog/*'`). Only matching paths are crawled. Can be repeated
`--dry-run`	Discover URLs (via sitemap/links) and print them without fetching content
`--smart-sample`	Auto-detect templated URL patterns and sample from large clusters instead of crawling every page
`--sample-size`	Pages to sample per templated cluster (default: `5`, used with `--smart-sample`)
`--sample-threshold`	Clusters larger than this are sampled (default: `20`, used with `--smart-sample`)
`--auto-resume`	Automatically resume if saved state exists, otherwise start fresh
`--cross-dedup`	Skip pages already seen in previous crawls to the same output directory
`--prioritize-links`	Score discovered links by predicted content yield — crawl high-value pages first
`--extractor`	Content extraction backend: `default`, `trafilatura`, `ensemble`, or `readerlm`
`--download-images`	Download images from the content area to `assets/` and use local paths in Markdown
`--min-image-size`	Minimum image file size in bytes to keep (default: `5000`). Smaller images are skipped
`--i18n-filter`	Skip URLs under locale path segments (`/fr/`, `/de-DE/`, `/zh-Hans/`, ...) — generic, no per-domain config
`--title-at-top`	Prepend `# {title}` to the `text` field of every JSONL row when not already present — top-MRR RAG recipe

Optional: structured extraction

If you need structured data (not just text), the extraction add-on uses an LLM to pull specific fields from each page.

pip install markcrawl[extract]

markcrawl-extract \
  --jsonl ./output/pages.jsonl \
  --fields company_name pricing features \
  --show-progress

Auto-discover fields across multiple crawled sites:

markcrawl-extract \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
  --auto-fields \
  --context "competitor pricing analysis" \
  --show-progress

Supports OpenAI, Anthropic (Claude), Google Gemini, and xAI (Grok) via --provider.

Extraction details

Provider and model selection

markcrawl-extract --jsonl ... --fields pricing --provider openai         # default
markcrawl-extract --jsonl ... --fields pricing --provider anthropic      # Claude
markcrawl-extract --jsonl ... --fields pricing --provider gemini         # Gemini
markcrawl-extract --jsonl ... --fields pricing --provider grok           # Grok
markcrawl-extract --jsonl ... --fields pricing --model gpt-4o           # override model

Provider	API key env var	Default model
OpenAI	`OPENAI_API_KEY`	`gpt-4o-mini`
Anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-20250514`
Google Gemini	`GEMINI_API_KEY`	`gemini-2.0-flash`
xAI (Grok)	`XAI_API_KEY`	`grok-3-mini-fast`

All extraction CLI arguments

Argument	Description
`--jsonl`	Path(s) to `pages.jsonl` — pass multiple for cross-site analysis
`--fields`	Field names to extract (space-separated)
`--auto-fields`	Auto-discover fields by sampling pages
`--context`	Describe your goal for auto-discovery
`--sample-size`	Pages to sample for auto-discovery (default: `3`)
`--provider`	`openai`, `anthropic`, `gemini`, or `grok`
`--model`	Override the default model
`--output`	Output path (default: `extracted.jsonl`)
`--delay`	Delay between LLM calls in seconds (default: `0.25`)
`--show-progress`	Print progress

Output format

Extracted rows include LLM attribution:

{
  "url": "https://competitor.com/pricing",
  "citation": "Pricing. competitor.com. Available at: ... [Accessed April 04, 2026].",
  "pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
  "extracted_by": "gpt-4o-mini (openai)",
  "extraction_note": "Field values were extracted by an LLM and may be interpreted, not verbatim."
}

Optional: Supabase vector search (RAG)

Chunk pages, generate embeddings, and upload to Supabase with pgvector:

pip install markcrawl[upload]

markcrawl --base https://docs.example.com --out ./output --show-progress
markcrawl-upload --jsonl ./output/pages.jsonl --show-progress

Requires SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY. See docs/SUPABASE.md for table setup, query examples, and recommendations.

Optional: agent integrations

MarkCrawl includes integrations for AI agents. Each is an optional add-on.

MCP Server (Claude Desktop, Cursor, Windsurf)

pip install markcrawl[mcp]

{
  "mcpServers": {
    "markcrawl": {
      "command": "python",
      "args": ["-m", "markcrawl.mcp_server"]
    }
  }
}

Tools: crawl_site, list_pages, read_page, search_pages, extract_data

LangChain Tool

pip install markcrawl[langchain]

from markcrawl.langchain import all_tools
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType

agent = initialize_agent(tools=all_tools, llm=ChatOpenAI(model="gpt-4o-mini"),
                         agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION)
agent.run("Crawl docs.example.com and summarize their auth guide")

OpenClaw Skill (WhatsApp, Telegram, Slack)

npx clawhub install markcrawl-skill

See AIMLPM/markcrawl-clawhub-skill.

LLM assistant prompt

Copy the system prompt from docs/LLM_PROMPT.md into any LLM to get an assistant that generates correct MarkCrawl commands.

When NOT to use MarkCrawl

Sites behind login/auth — no cookie or session support
Aggressive bot protection (Cloudflare, Akamai) — no anti-bot evasion
Millions of pages — designed for hundreds to low thousands; use Scrapy for scale
PDF content — HTML only (PDF support is on the roadmap)
JavaScript SPAs — add markcrawl[js] and use --render-js for React/Vue/Angular
Infinite-scroll pages — --render-js renders the initial page load but does not scroll; you'll get the first screenful of content (e.g., ~28 of 82 YouTube videos). For complete listings, combine with the platform's API or RSS feed (e.g., YouTube's /feeds/videos.xml?channel_id=...)

Architecture

MarkCrawl is a web crawler. The optional layers (extraction, upload, agents) are separate add-ons that work with the crawler's output.

CORE (free, no API keys)              OPTIONAL ADD-ONS
┌──────────────────────────┐
│ 1. Discover URLs         │          markcrawl[extract]  — LLM field extraction
│    (sitemap or links)    │          markcrawl[upload]   — Supabase/pgvector RAG
│ 2. Fetch & clean HTML    │          markcrawl[js]       — Playwright JS rendering
│ 3. Write Markdown + JSONL│          markcrawl[mcp]      — MCP server for agents
│    + auto-citation       │          markcrawl[langchain] — LangChain tools
└──────────────────────────┘

For internals, see docs/ARCHITECTURE.md.

Extending MarkCrawl

from markcrawl import crawl

result = crawl("https://example.com", out_dir="./output")
print(f"Saved {result.pages_saved} pages")

# Process output in your own pipeline
import json
with open(result.index_file) as f:
    for line in f:
        page = json.loads(line)
        your_db.insert(page)  # Pinecone, Weaviate, Elasticsearch, etc.

# Use individual components
from markcrawl import chunk_text
from markcrawl.extract import LLMClient, extract_fields

See docs/ARCHITECTURE.md for the full module map and extensibility guide.

Cost

The core crawler is free. Two optional features have API costs:

Feature	Cost	When
Structured extraction	~$0.01-0.03 per page	`markcrawl-extract`
Supabase upload	~$0.0001 per page	`markcrawl-upload`

Setting up API keys

Only needed for extraction and upload. The core crawler requires no keys.

# .env — in your working directory
OPENAI_API_KEY="sk-..."           # extraction (--provider openai) + upload
ANTHROPIC_API_KEY="sk-ant-..."    # extraction (--provider anthropic)
GEMINI_API_KEY="AI..."            # extraction (--provider gemini)
XAI_API_KEY="xai-..."             # extraction (--provider grok)
SUPABASE_URL="https://..."        # upload
SUPABASE_KEY="eyJ..."             # upload (service-role key)

source .env

Project structure

.
├── README.md
├── LICENSE
├── PRIVACY.md
├── SECURITY.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── Makefile
├── glama.json
├── pyproject.toml
├── requirements.txt
├── .github/
│   ├── pull_request_template.md
│   └── workflows/
│       ├── ci.yml
│       └── publish.yml
├── docs/
│   ├── ARCHITECTURE.md
│   ├── LLM_PROMPT.md
│   ├── MCP_SUBMISSION.md
│   ├── RAG_RETRIEVAL_RESEARCH.md
│   └── SUPABASE.md
├── tests/
│   ├── __init__.py
│   ├── test_chunker.py
│   ├── test_core.py
│   ├── test_extract.py
│   └── test_upload.py
└── markcrawl/
    ├── __init__.py
    ├── cli.py
    ├── core.py               # orchestrator
    ├── fetch.py              # HTTP/Playwright fetching
    ├── robots.py             # robots.txt parsing
    ├── throttle.py           # adaptive rate limiting
    ├── state.py              # crawl state & resume
    ├── urls.py               # URL normalization & filtering
    ├── extract_content.py    # HTML → Markdown conversion
    ├── dedup.py              # cross-crawl deduplication
    ├── link_scorer.py        # link prioritization
    ├── chunker.py
    ├── exceptions.py
    ├── utils.py
    ├── extract.py            # LLM field extraction
    ├── extract_cli.py
    ├── upload.py
    ├── upload_cli.py
    ├── langchain.py
    └── mcp_server.py

Roadmap

Canonical URL support
PDF support
Authenticated crawling
Multi-provider embeddings

Shipped features

pip install markcrawl on PyPI
200 automated tests + GitHub Actions CI (Python 3.10-3.13) + ruff linting
Markdown and plain text output with auto-citation
Sitemap-first crawling with robots.txt compliance
Text chunking with configurable overlap + semantic chunking
Supabase/pgvector upload for RAG
JavaScript rendering via Playwright
Concurrent fetching and proxy support
Resume interrupted crawls + auto-resume
LLM extraction (OpenAI, Claude, Gemini, Grok) with auto-field discovery
MCP server, LangChain tools, OpenClaw skill
Image alt text preservation
Python API (result.pages)
Page-type extraction and content-region heuristics
Multiple extraction backends (default, trafilatura, ensemble, ReaderLM-v2)
Cross-crawl deduplication (--cross-dedup)
Link prioritization by predicted content yield (--prioritize-links)
Smart sampling of templated URL clusters (--smart-sample)
URL path filtering (--include-path, --exclude-path) and dry-run preview

Contributing

See CONTRIBUTING.md. If you used an LLM to generate code, include the prompt in your PR.

Security

See SECURITY.md.

Privacy

MarkCrawl runs locally. No telemetry, no analytics, no data sent anywhere. See PRIVACY.md.

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

AIMLPM

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.11.1

May 12, 2026

0.11.0

May 6, 2026

0.10.6

May 5, 2026

0.10.5

May 5, 2026

0.10.4

May 5, 2026

0.10.3

May 4, 2026

0.10.2

May 3, 2026

0.10.1

May 3, 2026

0.10.0

May 3, 2026

0.9.3

Apr 26, 2026

0.9.2

Apr 25, 2026

0.9.1

Apr 25, 2026

0.9.0

Apr 23, 2026

0.8.0

Apr 22, 2026

0.7.0

Apr 19, 2026

0.6.0

Apr 18, 2026

0.5.0

Apr 14, 2026

0.4.1

Apr 13, 2026

0.4.0

Apr 13, 2026

0.3.1

Apr 13, 2026

0.3.0

Apr 13, 2026

0.2.0

Apr 12, 2026

0.1.1

Apr 9, 2026

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markcrawl-0.11.1.tar.gz (208.6 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markcrawl-0.11.1-py3-none-any.whl (143.4 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file markcrawl-0.11.1.tar.gz.

File metadata

Download URL: markcrawl-0.11.1.tar.gz
Upload date: May 12, 2026
Size: 208.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for markcrawl-0.11.1.tar.gz
Algorithm	Hash digest
SHA256	`d2737b39832dc423789a62697fb20d86f02ab372fbfb8a8ee868d5c0f1143209`
MD5	`e985ea39766a936a98ec4a909b52c10b`
BLAKE2b-256	`dd6d3534a7ec75629cc1735e0e5cac043cf8e47d521f9feca91180bd85f9486f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markcrawl-0.11.1.tar.gz:

Publisher: publish.yml on AIMLPM/markcrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markcrawl-0.11.1.tar.gz
- Subject digest: d2737b39832dc423789a62697fb20d86f02ab372fbfb8a8ee868d5c0f1143209
- Sigstore transparency entry: 1515810931
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: AIMLPM/markcrawl@a5e158b9f11b1672a033920e9bae9e4a47219ac1
- Branch / Tag: refs/tags/v0.11.1
- Owner: https://github.com/AIMLPM
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a5e158b9f11b1672a033920e9bae9e4a47219ac1
- Trigger Event: release

File details

Details for the file markcrawl-0.11.1-py3-none-any.whl.

File metadata

Download URL: markcrawl-0.11.1-py3-none-any.whl
Upload date: May 12, 2026
Size: 143.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for markcrawl-0.11.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c9240e3ab8f559ee594507ff29e540e3951b5907af856d6e92937d020bc5ce8`
MD5	`eeae3d2b4f70a6ad6b9a86078f60638a`
BLAKE2b-256	`7292ad6700a4a0671a5533281acca62db7f75ab3fe84b92d431ee26b26ae5baa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markcrawl-0.11.1-py3-none-any.whl:

Publisher: publish.yml on AIMLPM/markcrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markcrawl-0.11.1-py3-none-any.whl
- Subject digest: 9c9240e3ab8f559ee594507ff29e540e3951b5907af856d6e92937d020bc5ce8
- Sigstore transparency entry: 1515811012
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: AIMLPM/markcrawl@a5e158b9f11b1672a033920e9bae9e4a47219ac1
- Branch / Tag: refs/tags/v0.11.1
- Owner: https://github.com/AIMLPM
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a5e158b9f11b1672a033920e9bae9e4a47219ac1
- Trigger Event: release

markcrawl 0.11.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MarkCrawl by iD8 🕷️📝

Turn any webpage or website into clean Markdown for LLM pipelines — in one command.

Installation / Upgrading

Quickstart (2 minutes)

Common Recipes

Benchmark results — last public run (April 2026, v2 methodology)

Installation

Crawling

Output

Optional: structured extraction

Provider and model selection

All extraction CLI arguments

Output format

Optional: Supabase vector search (RAG)

Optional: agent integrations

When NOT to use MarkCrawl

Architecture

Extending MarkCrawl

Cost

Setting up API keys

Roadmap

Contributing

Security

Privacy

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance