Fast Python web crawler for AI & RAG ingestion โ crawl, extract, and embed website content with one tool.
Project description
MarkCrawl by iD8 ๐ท๏ธ๐
Fast Python Web Crawler for AI & RAG Ingestion
A lightweight Python website crawler that extracts clean Markdown or plain text from websites for AI ingestion, RAG pipelines, search indexing, and documentation archiving.
This project starts with a sitemap when available, respects robots.txt, keeps crawling in-scope, and writes both page files and a pages.jsonl index for downstream processing.
How it works
flowchart LR
A["๐ Website"] --> B["Crawl"]
B --> C["pages.jsonl\n+ .md files"]
C --> D{"What next?"}
D -->|"--auto-fields\nor --fields"| E["Extract\n(OpenAI / Claude / Gemini)"]
D -->|"upload_cli"| F["Chunk +\nEmbed"]
E --> G["extracted.jsonl\nStructured data"]
F --> H["Supabase\npgvector"]
H --> I["Vector\nSearch / RAG"]
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#fff9c4
style E fill:#fce4ec
style F fill:#fce4ec
style G fill:#e8f5e9
style H fill:#e8f5e9
style I fill:#e8f5e9
Free path: Crawl โ pages.jsonl โ done. No API keys needed. Extraction path: Add
OPENAI_API_KEY,ANTHROPIC_API_KEY, orGEMINI_API_KEYto pull structured fields from pages using the LLM of your choice. RAG path: Add Supabase credentials to chunk, embed, and store for vector search.
Why this exists
A lot of crawlers are either too heavyweight for small ingestion jobs or too focused on broad web scraping. This project is intentionally simple:
- crawl a single site or subdomain set
- extract readable content instead of raw HTML
- produce output that is easy to load into embeddings, vector stores, or search pipelines
- stay understandable and hackable for contributors
Features
- Sitemap-first crawling
robots.txtchecks- Optional subdomain support
- Markdown or plain text output
- Progress logging with a single CLI flag
- Retry and backoff support for transient errors
- Safe filenames with URL hashing
- JSONL index for ingestion workflows
- Basic content cleanup for nav / footer / utility elements
- Built-in text chunking for embeddings
- Supabase / pgvector upload with OpenAI embeddings
- Optional JavaScript rendering via Playwright
- Concurrent page fetching
- Proxy support
- Resume interrupted crawls
- LLM-powered structured extraction
Project structure
.
โโโ README.md
โโโ LICENSE
โโโ .gitignore
โโโ requirements.txt
โโโ CONTRIBUTING.md
โโโ CODE_OF_CONDUCT.md
โโโ SECURITY.md
โโโ tests/
โ โโโ test_core.py
โ โโโ test_chunker.py
โโโ webcrawler/
โโโ __init__.py
โโโ cli.py
โโโ core.py
โโโ chunker.py
โโโ upload.py
โโโ upload_cli.py
โโโ extract.py
โโโ extract_cli.py
โโโ mcp_server.py
Installation
Option 1: Run locally
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Option 2: Install as a local package
pip install -e .
Option 3: Install with JavaScript rendering support
pip install -e ".[js]"
playwright install chromium
This adds Playwright for rendering JavaScript-heavy sites with --render-js.
Option 4: Install with structured extraction support
pip install -e ".[extract]"
This adds the openai package needed for LLM-powered field extraction via python -m webcrawler.extract_cli.
What does extraction add? The base crawler gives you the full text content of every page. The extraction step uses an LLM to pull out specific structured fields you define. Here's the difference:
Without LLM extraction โ you get raw page content:
{
"url": "https://competitor.com/pricing",
"title": "Pricing - Competitor",
"text": "Pricing Plans\n\nStarter\n$29/month\nUp to 1,000 API calls...\n\nPro\n$99/month\nUp to 50,000 API calls...\n\nEnterprise\nContact us\nUnlimited API calls, SLA, dedicated support...\n\nAll plans include SSL, 99.9% uptime, and REST API access.\n\nQuestions? Contact sales@competitor.com"
}
This is useful for search and RAG, but you'd need to manually read through hundreds of pages to compare competitors or find specific details.
With LLM extraction (--fields pricing_tiers lowest_price enterprise_available api_included contact_email):
{
"url": "https://competitor.com/pricing",
"title": "Pricing - Competitor",
"pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
"lowest_price": "$29/month",
"enterprise_available": "Yes, contact sales for pricing",
"api_included": "Yes, REST API on all plans",
"contact_email": "sales@competitor.com"
}
Now you can load this into a spreadsheet or database and instantly compare across 10 competitors โ no manual reading required. You define the fields, the LLM finds the answers.
Option 5: Install with Supabase upload support
pip install -e ".[upload]"
This adds the openai and supabase packages needed for the upload command. After installing this way, you can also run website-crawler-upload directly instead of python -m webcrawler.upload_cli.
Option 6: Install with MCP server support
pip install -e ".[mcp]"
This adds the MCP SDK needed to run the webcrawler as an MCP server for AI agents (Claude Desktop, Cursor, Windsurf, etc.).
Option 7: Install everything
pip install -e ".[all]"
playwright install chromium
Cost
The crawler itself is completely free โ crawling, text extraction, chunking, resume, JS rendering, and proxy support use no paid APIs.
Only two optional features require an OpenAI API key (and therefore have token costs):
| Feature | When it costs money | Typical cost |
|---|---|---|
extract_cli (structured extraction) |
When you use --fields or --auto-fields to extract structured data via LLM |
~$0.01-0.03 per page (varies by provider and model) |
upload_cli (Supabase upload) |
When generating embeddings for vector search | ~$0.0001 per page with text-embedding-3-small |
Extraction supports three LLM providers โ use whichever you already have an API key for:
| Provider | Flag | API key env var | Default model |
|---|---|---|---|
| OpenAI | --provider openai |
OPENAI_API_KEY |
gpt-4o-mini |
| Anthropic (Claude) | --provider anthropic |
ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| Google Gemini | --provider gemini |
GEMINI_API_KEY |
gemini-2.0-flash |
You can use the full crawl pipeline (crawl โ chunk โ save files) without any API keys or costs.
Quick start โ full pipeline
# 1. Crawl a site
python -m webcrawler.cli \
--base https://docs.example.com/ \
--out ./output \
--format markdown \
--show-progress
# 2. Extract structured fields (requires OPENAI_API_KEY env var)
# Pass multiple pages.jsonl files to analyze across sites
python -m webcrawler.extract_cli \
--jsonl ./output/pages.jsonl \
--auto-fields \
--context "competitor analysis" \
--show-progress
# 3. Upload to Supabase (requires SUPABASE_URL, SUPABASE_KEY, OPENAI_API_KEY env vars)
python -m webcrawler.upload_cli \
--jsonl ./output/pages.jsonl \
--show-progress
Usage
Basic crawl
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--format markdown
Show progress output
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--format markdown \
--show-progress
Include subdomains
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--include-subdomains
Plain text output
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--format text
Crawl a JavaScript-heavy site
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--render-js
This launches a headless Chromium browser to fully render each page before extracting content. Use this for React, Angular, Vue, or other SPA-based sites.
Faster crawling with concurrency
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--concurrency 5 \
--show-progress
Fetches up to 5 pages in parallel. The delay is applied between batches rather than between individual requests.
Crawl through a proxy
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--proxy http://user:pass@proxy-host:8080
Works with both --render-js (Playwright) and standard requests.
Resume an interrupted crawl
python -m webcrawler.cli \
--base https://www.WEBSITE-TO-CRAWL.com/ \
--out ./output \
--resume \
--show-progress
If a crawl is interrupted (Ctrl+C, crash, or --max-pages limit), the crawler saves its state to .crawl_state.json in the output directory. Use --resume to pick up where it left off without re-fetching pages already saved.
CLI arguments
| Argument | Description |
|---|---|
--base |
Base site URL to crawl |
--out |
Output directory |
--use-sitemap |
Use sitemap(s) when available |
--delay |
Delay between requests in seconds |
--timeout |
Per-request timeout in seconds |
--max-pages |
Maximum number of pages to save; 0 means unlimited |
--include-subdomains |
Include subdomains under the base domain |
--format |
markdown or text |
--show-progress |
Print progress and crawl events |
--min-words |
Skip pages with very little content |
--user-agent |
Override the default user agent |
--render-js |
Render JavaScript with Playwright before extracting (requires .[js]) |
--concurrency |
Number of pages to fetch in parallel (default: 1) |
--proxy |
HTTP/HTTPS proxy URL |
--resume |
Resume a previously interrupted crawl from saved state |
Output
For each page, the crawler writes:
- a
.mdor.txtfile with extracted content - a
pages.jsonlindex row for downstream ingestion
Example JSONL row:
{
"url": "https://www.WEBSITE-TO-CRAWL.com/page",
"title": "Page Title",
"path": "page__abc123def0.md",
"text": "Extracted content..."
}
Example output tree:
output/
โโโ index__6dcd4ce23d.md
โโโ about__9c1185a5c5.md
โโโ docs-getting-started__0cc175b9c0.md
โโโ pages.jsonl
Uploading to Supabase for RAG
After crawling, you can chunk the output, generate embeddings, and upload directly to a Supabase table with pgvector for vector search.
1. Set up Supabase
In your Supabase project, go to the SQL Editor and run:
-- Enable the pgvector extension (one time per project)
create extension if not exists vector;
-- Create the documents table
create table documents (
id bigserial primary key,
url text not null,
title text,
chunk_text text not null,
chunk_index integer not null,
chunk_total integer not null,
embedding vector(1536) not null,
metadata jsonb default '{}'::jsonb,
created_at timestamptz default now()
);
-- Create an index for fast similarity search
create index on documents using hnsw (embedding vector_cosine_ops);
Note on dimensions: The default embedding model (
text-embedding-3-small) produces 1536-dimensional vectors. If you use a different model, updatevector(1536)to match.
2. Set environment variables
Create a .env file in the project root (it is already in .gitignore so it won't be committed):
# .env
SUPABASE_URL="https://your-project-id.supabase.co"
SUPABASE_KEY="your-service-role-key"
# LLM API keys โ only need one for extraction, OpenAI required for upload embeddings
OPENAI_API_KEY="your-openai-api-key"
ANTHROPIC_API_KEY="your-anthropic-api-key"
GEMINI_API_KEY="your-gemini-api-key"
Then load it before running the upload:
source .env
Use your service-role key (not the anon key) since it bypasses Row Level Security for inserts.
Security note: All credentials are read from environment variables only โ they are never accepted as command-line arguments to avoid leaking secrets in shell history or process listings. Never commit your
.envfile to git.
Credential management options
| Approach | Security | Complexity | Best for |
|---|---|---|---|
.env file + .gitignore |
Basic | Low | Local dev, personal projects |
| OS keychain (macOS Keychain, etc.) | Good | Medium | Single-user local tools |
| Secret manager (AWS SSM, GCP Secret Manager, Vault) | High | Higher | Production, teams, CI/CD |
This project uses the .env approach. If you deploy this as a service or share it with a team, consider upgrading to a secret manager.
3. Run the upload
python -m webcrawler.upload_cli \
--jsonl ./output/pages.jsonl \
--show-progress
Upload CLI arguments
| Argument | Description |
|---|---|
--jsonl |
Path to pages.jsonl from the crawler |
--table |
Target table name (default: documents) |
--max-words |
Max words per chunk (default: 400) |
--overlap-words |
Overlap words between chunks (default: 50) |
--embedding-model |
OpenAI embedding model (default: text-embedding-3-small) |
--show-progress |
Print progress during upload |
| Environment variable | Description |
|---|---|
SUPABASE_URL |
Supabase project URL (required) |
SUPABASE_KEY |
Supabase service-role key (required) |
OPENAI_API_KEY |
OpenAI API key for embeddings (required) |
4. Query with vector search
Once uploaded, you can find relevant chunks using cosine similarity:
-- Replace the array with your query's embedding vector
select
url,
title,
chunk_text,
1 - (embedding <=> '[0.012, -0.003, ...]') as similarity
from documents
order by embedding <=> '[0.012, -0.003, ...]'
limit 5;
In practice, you would generate the query embedding in your application code:
import os
import openai
from supabase import create_client
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_KEY"])
client = openai.OpenAI() # uses OPENAI_API_KEY env var
# Embed the user's question
query = "How do I set up authentication?"
response = client.embeddings.create(input=[query], model="text-embedding-3-small")
query_embedding = response.data[0].embedding
# Search for the most relevant chunks
result = supabase.rpc(
"match_documents",
{"query_embedding": query_embedding, "match_count": 5},
).execute()
for row in result.data:
print(f"{row['similarity']:.3f} {row['url']}")
print(f" {row['chunk_text'][:120]}...\n")
To use the match_documents RPC, create this function in Supabase:
create or replace function match_documents(
query_embedding vector(1536),
match_count int default 5
)
returns table (
id bigint,
url text,
title text,
chunk_text text,
similarity float
)
language sql stable
as $$
select
id,
url,
title,
chunk_text,
1 - (embedding <=> query_embedding) as similarity
from documents
order by embedding <=> query_embedding
limit match_count;
$$;
Supabase recommendations
- HNSW index: The
create index ... using hnswstatement above creates an approximate nearest-neighbor index. This is much faster than exact search for tables with more than a few thousand rows. (Supabase HNSW docs) - Service-role key: Use the service-role key for bulk inserts. For user-facing queries, use the anon key with Row Level Security enabled.
- Embedding model:
text-embedding-3-smallis a good balance of cost and quality. For higher accuracy, usetext-embedding-3-large(3072 dimensions โ update thevector()size accordingly). (OpenAI embeddings guide) - Chunk size: The default 400 words with 50-word overlap works well for most documentation. Decrease for short-form content, increase for long technical documents.
- pgvector reference: The
<=>operator is cosine distance (lower = more similar). See the pgvector documentation for all available distance operators.
These recommendations were verified against official documentation as of April 2026.
Structured extraction with LLM
After crawling, you can use an LLM to extract specific fields from each page โ useful for competitive research, API documentation analysis, or building structured datasets.
Option A: Let the LLM discover fields automatically
Don't know what fields to look for? Point the tool at your crawled pages and let it figure out what's worth extracting. This works best when you pass multiple crawled sites โ the LLM samples pages from each site and suggests fields that work consistently across all of them.
Recommended workflow โ crawl 2-3 sites first, then discover fields across all of them:
# Step 1: Crawl multiple competitor sites
python -m webcrawler.cli --base https://competitor1.com --out ./comp1 --show-progress
python -m webcrawler.cli --base https://competitor2.com --out ./comp2 --show-progress
python -m webcrawler.cli --base https://competitor3.com --out ./comp3 --show-progress
# Step 2: Auto-discover fields across all 3 sites
python -m webcrawler.extract_cli \
--jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
--auto-fields \
--context "competitor pricing and product analysis" \
--show-progress
The tool samples pages from each site, ensuring the suggested fields are useful for cross-site comparison โ not just specific to one site. Example output:
[info] loaded 142 page(s) from 3 file(s)
[discover] analyzing 3 sample page(s) to suggest fields...
[discover] sampling across 3 site(s) for cross-site field consistency
[discover] context: competitor pricing and product analysis
[discover] suggested fields: company_name, product_name, pricing_tiers, free_trial, key_features, target_market, integrations, support_options, api_available, deployment_model
[extract] 1/142 โ https://competitor1.com/
[extract] 2/142 โ https://competitor1.com/pricing
...
The output extracted.jsonl includes a source_file field so you can tell which site each row came from.
You can also control how many pages to sample:
python -m webcrawler.extract_cli \
--jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl \
--auto-fields \
--context "API documentation review" \
--sample-size 6 \
--show-progress
It also works with a single site if you just want to explore one crawl:
python -m webcrawler.extract_cli \
--jsonl ./output/pages.jsonl \
--auto-fields \
--show-progress
Option B: Specify fields manually
If you already know what you're looking for:
python -m webcrawler.extract_cli \
--jsonl ./output/pages.jsonl \
--fields company_name pricing features target_audience \
--show-progress
This also accepts multiple JSONL files:
python -m webcrawler.extract_cli \
--jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl \
--fields company_name pricing features \
--output ./comparison.jsonl \
--show-progress
Example: Extract API details from documentation
# Using OpenAI (default)
python -m webcrawler.extract_cli \
--jsonl ./output/pages.jsonl \
--fields api_endpoint http_method parameters response_format authentication \
--show-progress
# Using Anthropic Claude
python -m webcrawler.extract_cli \
--jsonl ./output/pages.jsonl \
--fields api_endpoint http_method parameters response_format authentication \
--provider anthropic \
--show-progress
# Using Google Gemini
python -m webcrawler.extract_cli \
--jsonl ./output/pages.jsonl \
--fields api_endpoint http_method parameters response_format authentication \
--provider gemini \
--show-progress
This produces an extracted.jsonl file with structured data:
{
"url": "https://docs.example.com/api/users",
"title": "Users API",
"api_endpoint": "/api/v1/users",
"http_method": "GET, POST",
"parameters": "id, name, email, role",
"response_format": "JSON",
"authentication": "Bearer token in Authorization header"
}
Extraction CLI arguments
| Argument | Description |
|---|---|
--jsonl |
Path(s) to pages.jsonl file(s) โ pass multiple to analyze across sites |
--fields |
Field names to extract (space-separated). Mutually exclusive with --auto-fields. |
--auto-fields |
Automatically discover fields by sampling pages across all input files. Mutually exclusive with --fields. |
--context |
Describe your goal to improve auto-field discovery (e.g. "competitor analysis") |
--sample-size |
Number of pages to sample for --auto-fields (default: 3). Samples are spread across all input files. |
--provider |
LLM provider: openai, anthropic, or gemini (default: openai) |
--output |
Output JSONL path (default: extracted.jsonl in first input file's directory) |
--model |
LLM model name (defaults to provider's recommended model) |
--show-progress |
Print progress during extraction |
| Environment variable | Required when |
|---|---|
OPENAI_API_KEY |
--provider openai (default) |
ANTHROPIC_API_KEY |
--provider anthropic |
GEMINI_API_KEY |
--provider gemini |
Tips
- Start with
--auto-fieldsacross 2-3 sites โ this gives the LLM enough variety to suggest fields that work for comparison, not just fields unique to one site - Use
--contextto steer field discovery โ "competitor pricing analysis" suggests different fields than "API documentation review" - Use descriptive field names with
--fieldsโ the LLM uses them to understand what to look for gpt-4o-miniis fast and cheap for most extraction tasks; usegpt-4ofor complex pages- Each page sends up to 8,000 characters to the LLM to stay within reasonable token limits
Using with AI agents (MCP)
This crawler includes a built-in Model Context Protocol (MCP) server, making it a plug-and-play data source for AI agents in Claude Desktop, Cursor, Windsurf, VS Code, and other MCP-compatible clients.
Install
pip install -e ".[mcp]"
Configure your MCP client
Add this to your MCP client's configuration:
Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"webcrawler": {
"command": "python",
"args": ["-m", "webcrawler.mcp_server"],
"env": {
"OPENAI_API_KEY": "your-key-here",
"WEBCRAWLER_OUTPUT_DIR": "./crawl_output"
}
}
}
}
Cursor / VS Code (.cursor/mcp.json or equivalent):
{
"mcpServers": {
"webcrawler": {
"command": "python",
"args": ["-m", "webcrawler.mcp_server"]
}
}
}
Available MCP tools
Once connected, your AI agent can use these tools:
| Tool | Description |
|---|---|
crawl_site |
Crawl a website and save extracted content. Returns a summary. |
list_pages |
List all crawled pages with titles and word counts. |
read_page |
Read the full content of a specific crawled page by URL. |
search_pages |
Search through crawled pages by keyword. |
extract_data |
Extract structured fields from pages using an LLM. Auto-discovers fields or uses specified ones. |
Example conversation with an AI agent
You: "Crawl the Stripe API docs and tell me about their authentication methods."
Agent (uses
crawl_site): Crawled 87 pages from https://docs.stripe.com/Agent (uses
search_pageswith query "authentication"): Found 5 results...Agent (uses
read_page): reads the full auth pageAgent: "Stripe supports three authentication methods: API keys, OAuth 2.0, and..."
Environment variables
| Variable | Description |
|---|---|
WEBCRAWLER_OUTPUT_DIR |
Default output directory for crawled data (default: ./crawl_output) |
OPENAI_API_KEY |
Required only if using the extract_data tool |
Running the MCP server standalone
python -m webcrawler.mcp_server
Good fit for
- RAG ingestion and agentic AI workflows
- knowledge base extraction
- internal site archiving
- documentation indexing
- competitor or market research on public pages
Not currently designed for
- authenticated crawling
- PDF extraction
- anti-bot evasion
Open-source roadmap
- Package publishing
- Automated tests
- GitHub Actions CI
- Canonical URL support
- Duplicate-content detection
- Optional chunking for embeddings
- Supabase / pgvector upload
- Browser-rendered page mode (Playwright)
- Concurrent fetching
- Proxy support
- Resume interrupted crawls
- LLM-powered structured extraction
- MCP server for AI agents
- PDF support
Legal and ethical use
Use this tool responsibly.
- Respect
robots.txt, site terms, and rate limits. - Only crawl content you are authorized to access.
- Do not use this project to evade access controls or scrape private content.
- You are responsible for complying with the target site's policies and applicable laws.
Contributing
Please read CONTRIBUTING.md before opening a pull request.
Security
If you discover a security issue, please follow the instructions in SECURITY.md.
License
This project is licensed under the MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markcrawl-0.1.0.tar.gz.
File metadata
- Download URL: markcrawl-0.1.0.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db875502f7d2e286eee42afdef0d8aa57ea01e68ecaece9fbfdca458a4c94e0d
|
|
| MD5 |
3a8fd797e06f770a6da86a16d58724aa
|
|
| BLAKE2b-256 |
948967ebfa858a2ceb96ec70be1cfc89d2e2b6d681d3dab19e33293487089a8e
|
File details
Details for the file markcrawl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: markcrawl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97246b09b4350334f1db9f370de2be188554e92b039afb20ce49293841b8426e
|
|
| MD5 |
830c5b244f8cf8738ff852393304a527
|
|
| BLAKE2b-256 |
577e6f3b5395df69b05c6b206b80b428c62463457fc393acc7aede24755f4c93
|