Fast Python web crawler for AI & RAG ingestion — crawl, extract, and embed website content with one tool.

These details have not been verified by PyPI

Project links

Project description

MarkCrawl by iD8 🕷️📝

Fast Python Web Crawler for AI & RAG Ingestion

A lightweight Python website crawler that extracts clean Markdown or plain text from websites for AI ingestion, RAG pipelines, search indexing, and documentation archiving.

This project starts with a sitemap when available, respects robots.txt, keeps crawling in-scope, and writes both page files and a pages.jsonl index for downstream processing.

How it works

flowchart LR
    A["🌐 Website"] --> B["Crawl"]
    B --> C["pages.jsonl\n+ .md files"]
    C --> D{"What next?"}
    D -->|"--auto-fields\nor --fields"| E["Extract\n(OpenAI / Claude / Gemini)"]
    D -->|"upload_cli"| F["Chunk +\nEmbed"]
    E --> G["extracted.jsonl\nStructured data"]
    F --> H["Supabase\npgvector"]
    H --> I["Vector\nSearch / RAG"]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#fff9c4
    style E fill:#fce4ec
    style F fill:#fce4ec
    style G fill:#e8f5e9
    style H fill:#e8f5e9
    style I fill:#e8f5e9

Free path: Crawl → pages.jsonl → done. No API keys needed. Extraction path: Add OPENAI_API_KEY, ANTHROPIC_API_KEY, or GEMINI_API_KEY to pull structured fields from pages using the LLM of your choice. RAG path: Add Supabase credentials to chunk, embed, and store for vector search.

Why this exists

A lot of crawlers are either too heavyweight for small ingestion jobs or too focused on broad web scraping. This project is intentionally simple:

crawl a single site or subdomain set
extract readable content instead of raw HTML
produce output that is easy to load into embeddings, vector stores, or search pipelines
stay understandable and hackable for contributors

Features

Sitemap-first crawling
robots.txt checks
Optional subdomain support
Markdown or plain text output
Progress logging with a single CLI flag
Retry and backoff support for transient errors
Safe filenames with URL hashing
JSONL index for ingestion workflows
Basic content cleanup for nav / footer / utility elements
Built-in text chunking for embeddings
Supabase / pgvector upload with OpenAI embeddings
Optional JavaScript rendering via Playwright
Concurrent page fetching
Proxy support
Resume interrupted crawls
LLM-powered structured extraction

Project structure

.
├── README.md
├── LICENSE
├── .gitignore
├── requirements.txt
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── SECURITY.md
├── tests/
│   ├── test_core.py
│   └── test_chunker.py
└── webcrawler/
    ├── __init__.py
    ├── cli.py
    ├── core.py
    ├── chunker.py
    ├── upload.py
    ├── upload_cli.py
    ├── extract.py
    ├── extract_cli.py
    └── mcp_server.py

Installation

Option 1: Run locally

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Option 2: Install as a local package

pip install -e .

Option 3: Install with JavaScript rendering support

pip install -e ".[js]"
playwright install chromium

This adds Playwright for rendering JavaScript-heavy sites with --render-js.

Option 4: Install with structured extraction support

pip install -e ".[extract]"

This adds the openai package needed for LLM-powered field extraction via python -m webcrawler.extract_cli.

What does extraction add? The base crawler gives you the full text content of every page. The extraction step uses an LLM to pull out specific structured fields you define. Here's the difference:

Without LLM extraction — you get raw page content:

{
  "url": "https://competitor.com/pricing",
  "title": "Pricing - Competitor",
  "text": "Pricing Plans\n\nStarter\n$29/month\nUp to 1,000 API calls...\n\nPro\n$99/month\nUp to 50,000 API calls...\n\nEnterprise\nContact us\nUnlimited API calls, SLA, dedicated support...\n\nAll plans include SSL, 99.9% uptime, and REST API access.\n\nQuestions? Contact sales@competitor.com"
}

This is useful for search and RAG, but you'd need to manually read through hundreds of pages to compare competitors or find specific details.

With LLM extraction (--fields pricing_tiers lowest_price enterprise_available api_included contact_email):

{
  "url": "https://competitor.com/pricing",
  "title": "Pricing - Competitor",
  "pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
  "lowest_price": "$29/month",
  "enterprise_available": "Yes, contact sales for pricing",
  "api_included": "Yes, REST API on all plans",
  "contact_email": "sales@competitor.com"
}

Now you can load this into a spreadsheet or database and instantly compare across 10 competitors — no manual reading required. You define the fields, the LLM finds the answers.

Option 5: Install with Supabase upload support

pip install -e ".[upload]"

This adds the openai and supabase packages needed for the upload command. After installing this way, you can also run website-crawler-upload directly instead of python -m webcrawler.upload_cli.

Option 6: Install with MCP server support

pip install -e ".[mcp]"

This adds the MCP SDK needed to run the webcrawler as an MCP server for AI agents (Claude Desktop, Cursor, Windsurf, etc.).

Option 7: Install everything

pip install -e ".[all]"
playwright install chromium

Cost

The crawler itself is completely free — crawling, text extraction, chunking, resume, JS rendering, and proxy support use no paid APIs.

Only two optional features require an OpenAI API key (and therefore have token costs):

Feature	When it costs money	Typical cost
`extract_cli` (structured extraction)	When you use `--fields` or `--auto-fields` to extract structured data via LLM	~$0.01-0.03 per page (varies by provider and model)
`upload_cli` (Supabase upload)	When generating embeddings for vector search	~$0.0001 per page with `text-embedding-3-small`

Extraction supports three LLM providers — use whichever you already have an API key for:

Provider	Flag	API key env var	Default model
OpenAI	`--provider openai`	`OPENAI_API_KEY`	`gpt-4o-mini`
Anthropic (Claude)	`--provider anthropic`	`ANTHROPIC_API_KEY`	`claude-sonnet-4-20250514`
Google Gemini	`--provider gemini`	`GEMINI_API_KEY`	`gemini-2.0-flash`

You can use the full crawl pipeline (crawl → chunk → save files) without any API keys or costs.

Quick start — full pipeline

# 1. Crawl a site
python -m webcrawler.cli \
  --base https://docs.example.com/ \
  --out ./output \
  --format markdown \
  --show-progress

# 2. Extract structured fields (requires OPENAI_API_KEY env var)
#    Pass multiple pages.jsonl files to analyze across sites
python -m webcrawler.extract_cli \
  --jsonl ./output/pages.jsonl \
  --auto-fields \
  --context "competitor analysis" \
  --show-progress

# 3. Upload to Supabase (requires SUPABASE_URL, SUPABASE_KEY, OPENAI_API_KEY env vars)
python -m webcrawler.upload_cli \
  --jsonl ./output/pages.jsonl \
  --show-progress

Usage

Basic crawl

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --format markdown

Show progress output

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --format markdown \
  --show-progress

Include subdomains

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --include-subdomains

Plain text output

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --format text

Crawl a JavaScript-heavy site

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --render-js

This launches a headless Chromium browser to fully render each page before extracting content. Use this for React, Angular, Vue, or other SPA-based sites.

Faster crawling with concurrency

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --concurrency 5 \
  --show-progress

Fetches up to 5 pages in parallel. The delay is applied between batches rather than between individual requests.

Crawl through a proxy

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --proxy http://user:pass@proxy-host:8080

Works with both --render-js (Playwright) and standard requests.

Resume an interrupted crawl

python -m webcrawler.cli \
  --base https://www.WEBSITE-TO-CRAWL.com/ \
  --out ./output \
  --resume \
  --show-progress

If a crawl is interrupted (Ctrl+C, crash, or --max-pages limit), the crawler saves its state to .crawl_state.json in the output directory. Use --resume to pick up where it left off without re-fetching pages already saved.

CLI arguments

Argument	Description
`--base`	Base site URL to crawl
`--out`	Output directory
`--use-sitemap`	Use sitemap(s) when available
`--delay`	Delay between requests in seconds
`--timeout`	Per-request timeout in seconds
`--max-pages`	Maximum number of pages to save; `0` means unlimited
`--include-subdomains`	Include subdomains under the base domain
`--format`	`markdown` or `text`
`--show-progress`	Print progress and crawl events
`--min-words`	Skip pages with very little content
`--user-agent`	Override the default user agent
`--render-js`	Render JavaScript with Playwright before extracting (requires `.[js]`)
`--concurrency`	Number of pages to fetch in parallel (default: `1`)
`--proxy`	HTTP/HTTPS proxy URL
`--resume`	Resume a previously interrupted crawl from saved state

Output

For each page, the crawler writes:

a .md or .txt file with extracted content
a pages.jsonl index row for downstream ingestion

Example JSONL row:

{
  "url": "https://www.WEBSITE-TO-CRAWL.com/page",
  "title": "Page Title",
  "path": "page__abc123def0.md",
  "text": "Extracted content..."
}

Example output tree:

output/
├── index__6dcd4ce23d.md
├── about__9c1185a5c5.md
├── docs-getting-started__0cc175b9c0.md
└── pages.jsonl

Uploading to Supabase for RAG

After crawling, you can chunk the output, generate embeddings, and upload directly to a Supabase table with pgvector for vector search.

1. Set up Supabase

In your Supabase project, go to the SQL Editor and run:

-- Enable the pgvector extension (one time per project)
create extension if not exists vector;

-- Create the documents table
create table documents (
  id bigserial primary key,
  url text not null,
  title text,
  chunk_text text not null,
  chunk_index integer not null,
  chunk_total integer not null,
  embedding vector(1536) not null,
  metadata jsonb default '{}'::jsonb,
  created_at timestamptz default now()
);

-- Create an index for fast similarity search
create index on documents using hnsw (embedding vector_cosine_ops);

Note on dimensions: The default embedding model (text-embedding-3-small) produces 1536-dimensional vectors. If you use a different model, update vector(1536) to match.

2. Set environment variables

Create a .env file in the project root (it is already in .gitignore so it won't be committed):

# .env
SUPABASE_URL="https://your-project-id.supabase.co"
SUPABASE_KEY="your-service-role-key"

# LLM API keys — only need one for extraction, OpenAI required for upload embeddings
OPENAI_API_KEY="your-openai-api-key"
ANTHROPIC_API_KEY="your-anthropic-api-key"
GEMINI_API_KEY="your-gemini-api-key"

Then load it before running the upload:

source .env

Use your service-role key (not the anon key) since it bypasses Row Level Security for inserts.

Security note: All credentials are read from environment variables only — they are never accepted as command-line arguments to avoid leaking secrets in shell history or process listings. Never commit your .env file to git.

Credential management options

Approach	Security	Complexity	Best for
`.env` file + `.gitignore`	Basic	Low	Local dev, personal projects
OS keychain (macOS Keychain, etc.)	Good	Medium	Single-user local tools
Secret manager (AWS SSM, GCP Secret Manager, Vault)	High	Higher	Production, teams, CI/CD

This project uses the .env approach. If you deploy this as a service or share it with a team, consider upgrading to a secret manager.

3. Run the upload

python -m webcrawler.upload_cli \
  --jsonl ./output/pages.jsonl \
  --show-progress

Upload CLI arguments

Argument	Description
`--jsonl`	Path to `pages.jsonl` from the crawler
`--table`	Target table name (default: `documents`)
`--max-words`	Max words per chunk (default: `400`)
`--overlap-words`	Overlap words between chunks (default: `50`)
`--embedding-model`	OpenAI embedding model (default: `text-embedding-3-small`)
`--show-progress`	Print progress during upload

Environment variable	Description
`SUPABASE_URL`	Supabase project URL (required)
`SUPABASE_KEY`	Supabase service-role key (required)
`OPENAI_API_KEY`	OpenAI API key for embeddings (required)

4. Query with vector search

Once uploaded, you can find relevant chunks using cosine similarity:

-- Replace the array with your query's embedding vector
select
  url,
  title,
  chunk_text,
  1 - (embedding <=> '[0.012, -0.003, ...]') as similarity
from documents
order by embedding <=> '[0.012, -0.003, ...]'
limit 5;

In practice, you would generate the query embedding in your application code:

import os
import openai
from supabase import create_client

supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_KEY"])
client = openai.OpenAI()  # uses OPENAI_API_KEY env var

# Embed the user's question
query = "How do I set up authentication?"
response = client.embeddings.create(input=[query], model="text-embedding-3-small")
query_embedding = response.data[0].embedding

# Search for the most relevant chunks
result = supabase.rpc(
    "match_documents",
    {"query_embedding": query_embedding, "match_count": 5},
).execute()

for row in result.data:
    print(f"{row['similarity']:.3f}  {row['url']}")
    print(f"  {row['chunk_text'][:120]}...\n")

To use the match_documents RPC, create this function in Supabase:

create or replace function match_documents(
  query_embedding vector(1536),
  match_count int default 5
)
returns table (
  id bigint,
  url text,
  title text,
  chunk_text text,
  similarity float
)
language sql stable
as $$
  select
    id,
    url,
    title,
    chunk_text,
    1 - (embedding <=> query_embedding) as similarity
  from documents
  order by embedding <=> query_embedding
  limit match_count;
$$;

Supabase recommendations

HNSW index: The create index ... using hnsw statement above creates an approximate nearest-neighbor index. This is much faster than exact search for tables with more than a few thousand rows. (Supabase HNSW docs)
Service-role key: Use the service-role key for bulk inserts. For user-facing queries, use the anon key with Row Level Security enabled.
Embedding model: text-embedding-3-small is a good balance of cost and quality. For higher accuracy, use text-embedding-3-large (3072 dimensions — update the vector() size accordingly). (OpenAI embeddings guide)
Chunk size: The default 400 words with 50-word overlap works well for most documentation. Decrease for short-form content, increase for long technical documents.
pgvector reference: The <=> operator is cosine distance (lower = more similar). See the pgvector documentation for all available distance operators.

These recommendations were verified against official documentation as of April 2026.

Structured extraction with LLM

After crawling, you can use an LLM to extract specific fields from each page — useful for competitive research, API documentation analysis, or building structured datasets.

Option A: Let the LLM discover fields automatically

Don't know what fields to look for? Point the tool at your crawled pages and let it figure out what's worth extracting. This works best when you pass multiple crawled sites — the LLM samples pages from each site and suggests fields that work consistently across all of them.

Recommended workflow — crawl 2-3 sites first, then discover fields across all of them:

# Step 1: Crawl multiple competitor sites
python -m webcrawler.cli --base https://competitor1.com --out ./comp1 --show-progress
python -m webcrawler.cli --base https://competitor2.com --out ./comp2 --show-progress
python -m webcrawler.cli --base https://competitor3.com --out ./comp3 --show-progress

# Step 2: Auto-discover fields across all 3 sites
python -m webcrawler.extract_cli \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
  --auto-fields \
  --context "competitor pricing and product analysis" \
  --show-progress

The tool samples pages from each site, ensuring the suggested fields are useful for cross-site comparison — not just specific to one site. Example output:

[info] loaded 142 page(s) from 3 file(s)
[discover] analyzing 3 sample page(s) to suggest fields...
[discover] sampling across 3 site(s) for cross-site field consistency
[discover] context: competitor pricing and product analysis
[discover] suggested fields: company_name, product_name, pricing_tiers, free_trial, key_features, target_market, integrations, support_options, api_available, deployment_model
[extract] 1/142 — https://competitor1.com/
[extract] 2/142 — https://competitor1.com/pricing
...

The output extracted.jsonl includes a source_file field so you can tell which site each row came from.

You can also control how many pages to sample:

python -m webcrawler.extract_cli \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl \
  --auto-fields \
  --context "API documentation review" \
  --sample-size 6 \
  --show-progress

It also works with a single site if you just want to explore one crawl:

python -m webcrawler.extract_cli \
  --jsonl ./output/pages.jsonl \
  --auto-fields \
  --show-progress

Option B: Specify fields manually

If you already know what you're looking for:

python -m webcrawler.extract_cli \
  --jsonl ./output/pages.jsonl \
  --fields company_name pricing features target_audience \
  --show-progress

This also accepts multiple JSONL files:

python -m webcrawler.extract_cli \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl \
  --fields company_name pricing features \
  --output ./comparison.jsonl \
  --show-progress

Example: Extract API details from documentation

# Using OpenAI (default)
python -m webcrawler.extract_cli \
  --jsonl ./output/pages.jsonl \
  --fields api_endpoint http_method parameters response_format authentication \
  --show-progress

# Using Anthropic Claude
python -m webcrawler.extract_cli \
  --jsonl ./output/pages.jsonl \
  --fields api_endpoint http_method parameters response_format authentication \
  --provider anthropic \
  --show-progress

# Using Google Gemini
python -m webcrawler.extract_cli \
  --jsonl ./output/pages.jsonl \
  --fields api_endpoint http_method parameters response_format authentication \
  --provider gemini \
  --show-progress

This produces an extracted.jsonl file with structured data:

{
  "url": "https://docs.example.com/api/users",
  "title": "Users API",
  "api_endpoint": "/api/v1/users",
  "http_method": "GET, POST",
  "parameters": "id, name, email, role",
  "response_format": "JSON",
  "authentication": "Bearer token in Authorization header"
}

Extraction CLI arguments

Argument	Description
`--jsonl`	Path(s) to `pages.jsonl` file(s) — pass multiple to analyze across sites
`--fields`	Field names to extract (space-separated). Mutually exclusive with `--auto-fields`.
`--auto-fields`	Automatically discover fields by sampling pages across all input files. Mutually exclusive with `--fields`.
`--context`	Describe your goal to improve auto-field discovery (e.g. `"competitor analysis"`)
`--sample-size`	Number of pages to sample for `--auto-fields` (default: `3`). Samples are spread across all input files.
`--provider`	LLM provider: `openai`, `anthropic`, or `gemini` (default: `openai`)
`--output`	Output JSONL path (default: `extracted.jsonl` in first input file's directory)
`--model`	LLM model name (defaults to provider's recommended model)
`--show-progress`	Print progress during extraction

Environment variable	Required when
`OPENAI_API_KEY`	`--provider openai` (default)
`ANTHROPIC_API_KEY`	`--provider anthropic`
`GEMINI_API_KEY`	`--provider gemini`

Tips

Start with --auto-fields across 2-3 sites — this gives the LLM enough variety to suggest fields that work for comparison, not just fields unique to one site
Use --context to steer field discovery — "competitor pricing analysis" suggests different fields than "API documentation review"
Use descriptive field names with --fields — the LLM uses them to understand what to look for
gpt-4o-mini is fast and cheap for most extraction tasks; use gpt-4o for complex pages
Each page sends up to 8,000 characters to the LLM to stay within reasonable token limits

Using with AI agents (MCP)

This crawler includes a built-in Model Context Protocol (MCP) server, making it a plug-and-play data source for AI agents in Claude Desktop, Cursor, Windsurf, VS Code, and other MCP-compatible clients.

Install

pip install -e ".[mcp]"

Configure your MCP client

Add this to your MCP client's configuration:

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "webcrawler": {
      "command": "python",
      "args": ["-m", "webcrawler.mcp_server"],
      "env": {
        "OPENAI_API_KEY": "your-key-here",
        "WEBCRAWLER_OUTPUT_DIR": "./crawl_output"
      }
    }
  }
}

Cursor / VS Code (.cursor/mcp.json or equivalent):

{
  "mcpServers": {
    "webcrawler": {
      "command": "python",
      "args": ["-m", "webcrawler.mcp_server"]
    }
  }
}

Available MCP tools

Once connected, your AI agent can use these tools:

Tool	Description
`crawl_site`	Crawl a website and save extracted content. Returns a summary.
`list_pages`	List all crawled pages with titles and word counts.
`read_page`	Read the full content of a specific crawled page by URL.
`search_pages`	Search through crawled pages by keyword.
`extract_data`	Extract structured fields from pages using an LLM. Auto-discovers fields or uses specified ones.

Example conversation with an AI agent

You: "Crawl the Stripe API docs and tell me about their authentication methods."

Agent (uses crawl_site): Crawled 87 pages from https://docs.stripe.com/

Agent (uses search_pages with query "authentication"): Found 5 results...

Agent (uses read_page): reads the full auth page

Agent: "Stripe supports three authentication methods: API keys, OAuth 2.0, and..."

Environment variables

Variable	Description
`WEBCRAWLER_OUTPUT_DIR`	Default output directory for crawled data (default: `./crawl_output`)
`OPENAI_API_KEY`	Required only if using the `extract_data` tool

Running the MCP server standalone

python -m webcrawler.mcp_server

Good fit for

RAG ingestion and agentic AI workflows
knowledge base extraction
internal site archiving
documentation indexing
competitor or market research on public pages

Not currently designed for

authenticated crawling
PDF extraction
anti-bot evasion

Open-source roadmap

Package publishing
Automated tests
GitHub Actions CI
Canonical URL support
Duplicate-content detection
Optional chunking for embeddings
Supabase / pgvector upload
Browser-rendered page mode (Playwright)
Concurrent fetching
Proxy support
Resume interrupted crawls
LLM-powered structured extraction
MCP server for AI agents
PDF support

Legal and ethical use

Use this tool responsibly.

Respect robots.txt, site terms, and rate limits.
Only crawl content you are authorized to access.
Do not use this project to evade access controls or scrape private content.
You are responsible for complying with the target site's policies and applicable laws.

Contributing

Please read CONTRIBUTING.md before opening a pull request.

Security

If you discover a security issue, please follow the instructions in SECURITY.md.

License

This project is licensed under the MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Apr 18, 2026

0.5.0

Apr 14, 2026

0.4.1

Apr 13, 2026

0.4.0

Apr 13, 2026

0.3.1

Apr 13, 2026

0.3.0

Apr 13, 2026

0.2.0

Apr 12, 2026

0.1.1

Apr 9, 2026

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markcrawl-0.1.0.tar.gz (32.6 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markcrawl-0.1.0-py3-none-any.whl (32.2 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file markcrawl-0.1.0.tar.gz.

File metadata

Download URL: markcrawl-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for markcrawl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`db875502f7d2e286eee42afdef0d8aa57ea01e68ecaece9fbfdca458a4c94e0d`
MD5	`3a8fd797e06f770a6da86a16d58724aa`
BLAKE2b-256	`948967ebfa858a2ceb96ec70be1cfc89d2e2b6d681d3dab19e33293487089a8e`

See more details on using hashes here.

File details

Details for the file markcrawl-0.1.0-py3-none-any.whl.

File metadata

Download URL: markcrawl-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 32.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for markcrawl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97246b09b4350334f1db9f370de2be188554e92b039afb20ce49293841b8426e`
MD5	`830c5b244f8cf8738ff852393304a527`
BLAKE2b-256	`577e6f3b5395df69b05c6b206b80b428c62463457fc393acc7aede24755f4c93`

See more details on using hashes here.

markcrawl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MarkCrawl by iD8 🕷️📝

Fast Python Web Crawler for AI & RAG Ingestion

How it works

Why this exists

Features

Project structure

Installation

Option 1: Run locally

Option 2: Install as a local package

Option 3: Install with JavaScript rendering support

Option 4: Install with structured extraction support

Option 5: Install with Supabase upload support

Option 6: Install with MCP server support

Option 7: Install everything

Cost

Quick start — full pipeline

Usage

Basic crawl

Show progress output

Include subdomains

Plain text output

Crawl a JavaScript-heavy site

Faster crawling with concurrency

Crawl through a proxy

Resume an interrupted crawl

CLI arguments

Output

Uploading to Supabase for RAG

1. Set up Supabase

2. Set environment variables

Credential management options

3. Run the upload

Upload CLI arguments

4. Query with vector search

Supabase recommendations

Structured extraction with LLM

Option A: Let the LLM discover fields automatically

Option B: Specify fields manually

Example: Extract API details from documentation

Extraction CLI arguments

Tips

Using with AI agents (MCP)

Install

Configure your MCP client

Available MCP tools

Example conversation with an AI agent

Environment variables

Running the MCP server standalone

Good fit for

Not currently designed for

Open-source roadmap

Legal and ethical use

Contributing

Security

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata