Web scraping and content extraction tool

Project description

CrawlVox

Extract clean, structured content from any website.

A command-line web crawler that handles static HTML, JavaScript-heavy SPAs, and PDFs — built for content migration, web archiving, and research workflows.

Crawl websites at scale | Extract articles, metadata, images & PDFs | Export to JSONL, CSV or Markdown

Personal Use & Legal Notice

CrawlVox is intended for personal and educational use only. Always respect the laws of your jurisdiction, website terms of service, and robots.txt directives before crawling any website. The authors are not responsible for misuse. By using this software, you agree to comply with all applicable local, national, and international laws, including but not limited to data protection regulations (GDPR, CCPA), computer fraud laws, and intellectual property rights. Do not use this tool to scrape websites without authorization.

Why CrawlVox?

Most web scrapers give you raw HTML and leave you to figure out the content. CrawlVox gives you clean, readable text with full metadata — ready to use.

crawlvox crawl https://example.com --max-pages 50 --export-jsonl results.jsonl

That's it. 50 pages crawled, content extracted, metadata captured, exported to a file.

What Makes It Different

Dual extraction engine — trafilatura for speed, readability-lxml as fallback. You get content, not boilerplate.
Smart JS rendering — Only fires up a browser when static HTML doesn't have enough content. Saves time without missing SPAs.
PDF pipeline — Finds PDFs during crawl, extracts text with OCR support. No separate tool needed.
Image deduplication — Downloads images with two-tier dedup (URL + content hash). No duplicates across runs.
Resumable crawls — Ctrl+C anytime. Resume later with --resume. State is saved automatically.
Ethical by default — Respects robots.txt, rate-limits per domain, backs off on 429/503 errors.

Features

Crawling

BFS traversal with depth control (1-20 levels)
1-100 concurrent workers
Same-domain or cross-domain scope
URL allowlist/denylist via regex
Infinite trap avoidance (calendar, faceted search)
sitemap.xml seeding for efficient discovery

Content Extraction

Clean article text via trafilatura + readability
Title, description, canonical URL
OpenGraph & Twitter Card metadata
Language detection
Link extraction with anchor text & classification
Image extraction with alt text & dimensions

JavaScript & Documents

Playwright-based rendering (off / auto / always)
Resource blocking for 60-80% speed boost
PDF text extraction via Docling
OCR support (auto-detect or force)
Multi-language OCR (en, fr, de, es, and more)

Storage & Export

SQLite with WAL mode for async safety
Normalized URL deduplication
Export to JSONL, CSV, or Markdown
Run history with status tracking
Filter exports by run, URL pattern, or status

Reliability

Resumable crawls with database-backed state
Per-domain rate limiting (configurable)
Exponential backoff on 429/503
Retry logic with tenacity (3 attempts)
Graceful shutdown — completes in-flight pages

Ethics & Safety

robots.txt respected by default
Crawl-delay auto-adjustment
Rate limiting prevents server overload
Configurable User-Agent header
Same-domain scope by default

Installation

Requirements: Python 3.11+

# Clone the repository
git clone https://github.com/your-username/crawlvox.git
cd crawlvox

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
source .venv/Scripts/activate    # Windows (Git Bash)
.venv\Scripts\activate           # Windows (CMD)

# Install CrawlVox
pip install -e .

# Install Playwright browser (for JavaScript rendering)
playwright install chromium

Quick Start

# Crawl a website (100 pages by default)
crawlvox crawl https://example.com

# Crawl with limits and auto-export
crawlvox crawl https://example.com --max-pages 50 --max-depth 2 --export-jsonl results.jsonl

# Enable JavaScript rendering for SPAs
crawlvox crawl https://spa-site.com --dynamic always

# Download images too
crawlvox crawl https://example.com --download-images

# Process PDFs found during crawl
crawlvox crawl https://example.com --process-documents

# Resume an interrupted crawl
crawlvox crawl https://example.com --resume

# Check crawl history
crawlvox status

# Export data from database
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl

Commands

`crawlvox crawl`

Crawl one or more websites and extract content.

crawlvox crawl [OPTIONS] URLS...

Crawling Options

Option	Default	Description
`-w, --workers`	10	Concurrent workers (1-100)
`-p, --max-pages`	100	Maximum pages to crawl
`--max-depth`	3	Maximum link depth (1-20)
`-t, --timeout`	30.0	Request timeout in seconds
`-r, --rate-limit`	2.0	Max requests/second per domain
`--no-robots`	off	Disable robots.txt respect
`--same-domain / --cross-domain`	same-domain	Domain scope control
`-i, --include`	none	Regex allowlist for URLs (repeatable)
`-e, --exclude`	none	Regex denylist for URLs (repeatable)
`--user-agent`	CrawlVox/0.1	HTTP User-Agent header
`--cookie-file`	none	Path to cookie file (LWP format)

JavaScript Rendering

Option	Default	Description
`--dynamic`	auto	`off` = static only, `auto` = fallback on low content, `always` = render all
`--min-content-length`	200	Character threshold before triggering JS fallback

# Force JS rendering on all pages
crawlvox crawl https://spa-site.com --dynamic always

# Static-only (fastest)
crawlvox crawl https://static-site.com --dynamic off

Image Downloading

Option	Default	Description
`--download-images`	off	Enable binary image downloading
`--image-dir`	images	Directory for downloaded images
`--image-scope`	same-domain	`same-domain` or `all` (includes CDNs)
`--max-image-size`	10mb	Max image file size

# Download images from any source
crawlvox crawl https://example.com --download-images --image-scope all

PDF/Document Processing

Option	Default	Description
`--process-documents`	off	Enable PDF text extraction via Docling
`--ocr-mode`	auto	`off`, `auto`, or `always`
`--ocr-language`	en	OCR language code
`--max-document-size`	50mb	Max PDF file size
`--max-document-pages`	500	Max PDF page count

# Process PDFs with forced OCR
crawlvox crawl https://example.com --process-documents --ocr-mode always

Output & Resume

Option	Default	Description
`--store-html`	off	Store raw HTML in database
`--export-jsonl`	none	Auto-export to JSONL after crawl
`-q, --quiet`	off	Suppress progress output
`-l, --log-level`	INFO	DEBUG, INFO, WARNING, ERROR
`--resume`	off	Resume an interrupted crawl
`--recrawl`	off	Re-process pages on resume

# Start a large crawl (Ctrl+C to interrupt safely)
crawlvox crawl https://large-site.com --max-pages 1000

# Resume where you left off
crawlvox crawl https://large-site.com --max-pages 1000 --resume

`crawlvox export`

Export crawl data to a file.

crawlvox export [OPTIONS]

Formats:

# JSONL (one JSON object per line)
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl

# CSV (flat table)
crawlvox export -d crawlvox.db -f csv -o output.csv

# Markdown (one .md file per page)
crawlvox export -d crawlvox.db -f markdown -o ./pages/

Filtering:

# By run ID
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --run-id abc123

# By URL pattern
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --url-pattern "%/blog/%"

# By status
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status ok
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status error

# Include raw HTML
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --include-html

`crawlvox status`

Show crawl run history and storage statistics.

crawlvox status
crawlvox status -d myproject.db
crawlvox status --limit 20

`crawlvox purge`

Delete old crawl runs and their associated data.

# Purge a specific run
crawlvox purge --run abc123def456

# Purge runs older than 7 days
crawlvox purge --older-than 7d

# Purge runs older than 2 weeks
crawlvox purge --older-than 2w

Active (running) crawls are never purged.

Output Format

JSONL

Each line is a self-contained JSON object:

{
  "type": "page",
  "url": "https://example.com/about",
  "final_url": "https://example.com/about",
  "title": "About Us",
  "description": "Learn more about our company",
  "text": "Extracted main content text...",
  "fetched_at": "2025-01-15T10:30:00+00:00",
  "status_code": 200,
  "fetch_method": "static",
  "extraction_method": "trafilatura",
  "canonical_url": "https://example.com/about",
  "og_title": "About Us | Example",
  "og_description": "Learn more about our company",
  "og_image": "https://example.com/og-about.jpg",
  "language": "en",
  "content_type": "text/html",
  "error": null,
  "images": [
    {
      "src": "https://example.com/team.jpg",
      "alt": "Our team",
      "local_path": null
    }
  ]
}

CSV

Flat table with columns: url, final_url, title, description, text, status_code, content_type, fetch_method, extraction_method, language, fetched_at, error.

Markdown

One .md file per page with metadata header and extracted text body.

URL Filtering

Control which URLs get crawled with regex patterns:

# Only crawl blog pages
crawlvox crawl https://example.com -i "/blog/"

# Skip admin and login pages
crawlvox crawl https://example.com -e "/admin/" -e "/login/"

# Combine: product pages only, skip archived
crawlvox crawl https://example.com -i "/products/" -e "/archived/"

Deny patterns (-e) always take priority over allow patterns (-i).

Database

All crawl data is stored in a local SQLite database (default: crawlvox.db) using WAL mode for safe concurrent access.

Tables:

Table	Purpose
`pages`	URL, status code, extracted text, metadata, timestamps
`links`	Source page, target URL, anchor text, internal/external classification
`images`	Source page, image URL, alt text, local file path, SHA256 hash
`documents`	PDF/document processing results
`runs`	Crawl run metadata, status, configuration
`run_pages`	Maps pages to the runs that crawled them

Query directly with any SQLite client:

sqlite3 crawlvox.db "SELECT url, title FROM pages LIMIT 10"

Architecture

URL Input
    |
    v
[Scope Checker] --> robots.txt / sitemap.xml
    |
    v
[Worker Pool] ---> [Fetcher] ---> httpx (static)
    |                  |
    |                  +---------> [Playwright] (dynamic fallback)
    |
    v
[Content Extractor] --> trafilatura (primary)
    |                       |
    |                       +--> readability-lxml (fallback)
    |
    +---> [Metadata Extractor] --> OG, Twitter, canonical, lang
    +---> [Link Extractor] ------> internal/external classification
    +---> [Image Extractor] -----> optional binary download + dedup
    +---> [Document Processor] --> PDF text extraction + OCR
    |
    v
[SQLite Storage] --> WAL mode, normalized URLs, run tracking
    |
    v
[Export] --> JSONL / CSV / Markdown

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with verbose output
pytest -v

# Run a specific test file
pytest tests/test_fetcher.py

# Type checking
mypy src/crawlvox/

Responsible Use

CrawlVox is designed with ethical crawling in mind:

robots.txt is respected by default — disable only when you have explicit permission
Rate limiting prevents server overload — default 2 req/sec per domain
Same-domain scope prevents unintended cross-site crawling
Crawl-delay headers are automatically honored

Please use this tool responsibly:

Only crawl websites you have permission to access
Respect robots.txt directives and website terms of service
Use appropriate rate limits to avoid overloading servers
Comply with all applicable laws in your jurisdiction, including GDPR, CCPA, CFAA, and local equivalents
Do not use extracted content in ways that violate copyright or intellectual property rights
When in doubt, ask the website owner for permission

This tool is provided for personal, educational, and legitimate research purposes. The authors assume no liability for misuse.

License

MIT

Built with Python, asyncio, and respect for the web.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlvox-0.1.0.tar.gz (289.7 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawlvox-0.1.0-py3-none-any.whl (57.8 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file crawlvox-0.1.0.tar.gz.

File metadata

Download URL: crawlvox-0.1.0.tar.gz
Upload date: Feb 25, 2026
Size: 289.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawlvox-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1fd4b11c151fe3add5e5c8a9793f97cb845306fc909ac34a67571a1903d51401`
MD5	`da3ee26f7013f4cff434ae148a45ed0e`
BLAKE2b-256	`e24b508d32f4a6e53be2e3309c8d197d21f24d984c00771894f6f30b1a92ddda`

See more details on using hashes here.

File details

Details for the file crawlvox-0.1.0-py3-none-any.whl.

File metadata

Download URL: crawlvox-0.1.0-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 57.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawlvox-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d057aec1b11a2575fe107a1a413e4e6fcd27ee03bc559f58010096cdb115dc0c`
MD5	`61d3817ca71c04ff2a41091799ea91eb`
BLAKE2b-256	`aa2992659378a19df7c24822b07f5036023cff885ae0c61cd301dccbe6c730e1`

See more details on using hashes here.

crawlvox 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

CrawlVox

Why CrawlVox?

What Makes It Different

Features

Crawling

Content Extraction

JavaScript & Documents

Storage & Export

Reliability

Ethics & Safety

Installation

Quick Start

Commands

crawlvox crawl

Crawling Options

JavaScript Rendering

Image Downloading

PDF/Document Processing

Output & Resume

crawlvox export

crawlvox status

crawlvox purge

Output Format

JSONL

CSV

Markdown

URL Filtering

Database

Architecture

Development

Responsible Use

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`crawlvox crawl`

`crawlvox export`

`crawlvox status`

`crawlvox purge`