Skip to main content

Web scraping and content extraction tool

Project description

CrawlVox Logo

CrawlVox

Extract clean, structured content from any website.

A command-line web crawler that handles static HTML, JavaScript-heavy SPAs, and PDFs — built for content migration, web archiving, and research workflows.

Python 3.11+ License: MIT Code style: black


Crawl websites at scale  |  Extract articles, metadata, images & PDFs  |  Export to JSONL, CSV or Markdown


Personal Use & Legal Notice

CrawlVox is intended for personal and educational use only. Always respect the laws of your jurisdiction, website terms of service, and robots.txt directives before crawling any website. The authors are not responsible for misuse. By using this software, you agree to comply with all applicable local, national, and international laws, including but not limited to data protection regulations (GDPR, CCPA), computer fraud laws, and intellectual property rights. Do not use this tool to scrape websites without authorization.


Why CrawlVox?

Most web scrapers give you raw HTML and leave you to figure out the content. CrawlVox gives you clean, readable text with full metadata — ready to use.

crawlvox crawl https://example.com --max-pages 50 --export-jsonl results.jsonl

That's it. 50 pages crawled, content extracted, metadata captured, exported to a file.

What Makes It Different

  • Dual extraction engine — trafilatura for speed, readability-lxml as fallback. You get content, not boilerplate.
  • Smart JS rendering — Only fires up a browser when static HTML doesn't have enough content. Saves time without missing SPAs.
  • PDF pipeline — Finds PDFs during crawl, extracts text with OCR support. No separate tool needed.
  • Image deduplication — Downloads images with two-tier dedup (URL + content hash). No duplicates across runs.
  • Resumable crawls — Ctrl+C anytime. Resume later with --resume. State is saved automatically.
  • Ethical by default — Respects robots.txt, rate-limits per domain, backs off on 429/503 errors.

Features

Crawling

  • BFS traversal with depth control (1-20 levels)
  • 1-100 concurrent workers
  • Same-domain or cross-domain scope
  • URL allowlist/denylist via regex
  • Infinite trap avoidance (calendar, faceted search)
  • sitemap.xml seeding for efficient discovery

Content Extraction

  • Clean article text via trafilatura + readability
  • Title, description, canonical URL
  • OpenGraph & Twitter Card metadata
  • Language detection
  • Link extraction with anchor text & classification
  • Image extraction with alt text & dimensions

JavaScript & Documents

  • Playwright-based rendering (off / auto / always)
  • Resource blocking for 60-80% speed boost
  • PDF text extraction via Docling
  • OCR support (auto-detect or force)
  • Multi-language OCR (en, fr, de, es, and more)

Storage & Export

  • SQLite with WAL mode for async safety
  • Normalized URL deduplication
  • Export to JSONL, CSV, or Markdown
  • Run history with status tracking
  • Filter exports by run, URL pattern, or status

Reliability

  • Resumable crawls with database-backed state
  • Per-domain rate limiting (configurable)
  • Exponential backoff on 429/503
  • Retry logic with tenacity (3 attempts)
  • Graceful shutdown — completes in-flight pages

Ethics & Safety

  • robots.txt respected by default
  • Crawl-delay auto-adjustment
  • Rate limiting prevents server overload
  • Configurable User-Agent header
  • Same-domain scope by default

Installation

Requirements: Python 3.11+

# Clone the repository
git clone https://github.com/your-username/crawlvox.git
cd crawlvox

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
source .venv/Scripts/activate    # Windows (Git Bash)
.venv\Scripts\activate           # Windows (CMD)

# Install CrawlVox
pip install -e .

# Install Playwright browser (for JavaScript rendering)
playwright install chromium

Quick Start

# Crawl a website (100 pages by default)
crawlvox crawl https://example.com

# Crawl with limits and auto-export
crawlvox crawl https://example.com --max-pages 50 --max-depth 2 --export-jsonl results.jsonl

# Enable JavaScript rendering for SPAs
crawlvox crawl https://spa-site.com --dynamic always

# Download images too
crawlvox crawl https://example.com --download-images

# Process PDFs found during crawl
crawlvox crawl https://example.com --process-documents

# Resume an interrupted crawl
crawlvox crawl https://example.com --resume

# Check crawl history
crawlvox status

# Export data from database
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl

Commands

crawlvox crawl

Crawl one or more websites and extract content.

crawlvox crawl [OPTIONS] URLS...

Crawling Options

Option Default Description
-w, --workers 10 Concurrent workers (1-100)
-p, --max-pages 100 Maximum pages to crawl
--max-depth 3 Maximum link depth (1-20)
-t, --timeout 30.0 Request timeout in seconds
-r, --rate-limit 2.0 Max requests/second per domain
--no-robots off Disable robots.txt respect
--same-domain / --cross-domain same-domain Domain scope control
-i, --include none Regex allowlist for URLs (repeatable)
-e, --exclude none Regex denylist for URLs (repeatable)
--user-agent CrawlVox/0.1 HTTP User-Agent header
--cookie-file none Path to cookie file (LWP format)

JavaScript Rendering

Option Default Description
--dynamic auto off = static only, auto = fallback on low content, always = render all
--min-content-length 200 Character threshold before triggering JS fallback
# Force JS rendering on all pages
crawlvox crawl https://spa-site.com --dynamic always

# Static-only (fastest)
crawlvox crawl https://static-site.com --dynamic off

Image Downloading

Option Default Description
--download-images off Enable binary image downloading
--image-dir images Directory for downloaded images
--image-scope same-domain same-domain or all (includes CDNs)
--max-image-size 10mb Max image file size
# Download images from any source
crawlvox crawl https://example.com --download-images --image-scope all

PDF/Document Processing

Option Default Description
--process-documents off Enable PDF text extraction via Docling
--ocr-mode auto off, auto, or always
--ocr-language en OCR language code
--max-document-size 50mb Max PDF file size
--max-document-pages 500 Max PDF page count
# Process PDFs with forced OCR
crawlvox crawl https://example.com --process-documents --ocr-mode always

Output & Resume

Option Default Description
--store-html off Store raw HTML in database
--export-jsonl none Auto-export to JSONL after crawl
-q, --quiet off Suppress progress output
-l, --log-level INFO DEBUG, INFO, WARNING, ERROR
--resume off Resume an interrupted crawl
--recrawl off Re-process pages on resume
# Start a large crawl (Ctrl+C to interrupt safely)
crawlvox crawl https://large-site.com --max-pages 1000

# Resume where you left off
crawlvox crawl https://large-site.com --max-pages 1000 --resume

crawlvox export

Export crawl data to a file.

crawlvox export [OPTIONS]

Formats:

# JSONL (one JSON object per line)
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl

# CSV (flat table)
crawlvox export -d crawlvox.db -f csv -o output.csv

# Markdown (one .md file per page)
crawlvox export -d crawlvox.db -f markdown -o ./pages/

Filtering:

# By run ID
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --run-id abc123

# By URL pattern
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --url-pattern "%/blog/%"

# By status
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status ok
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status error

# Include raw HTML
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --include-html

crawlvox status

Show crawl run history and storage statistics.

crawlvox status
crawlvox status -d myproject.db
crawlvox status --limit 20

crawlvox purge

Delete old crawl runs and their associated data.

# Purge a specific run
crawlvox purge --run abc123def456

# Purge runs older than 7 days
crawlvox purge --older-than 7d

# Purge runs older than 2 weeks
crawlvox purge --older-than 2w

Active (running) crawls are never purged.


Output Format

JSONL

Each line is a self-contained JSON object:

{
  "type": "page",
  "url": "https://example.com/about",
  "final_url": "https://example.com/about",
  "title": "About Us",
  "description": "Learn more about our company",
  "text": "Extracted main content text...",
  "fetched_at": "2025-01-15T10:30:00+00:00",
  "status_code": 200,
  "fetch_method": "static",
  "extraction_method": "trafilatura",
  "canonical_url": "https://example.com/about",
  "og_title": "About Us | Example",
  "og_description": "Learn more about our company",
  "og_image": "https://example.com/og-about.jpg",
  "language": "en",
  "content_type": "text/html",
  "error": null,
  "images": [
    {
      "src": "https://example.com/team.jpg",
      "alt": "Our team",
      "local_path": null
    }
  ]
}

CSV

Flat table with columns: url, final_url, title, description, text, status_code, content_type, fetch_method, extraction_method, language, fetched_at, error.

Markdown

One .md file per page with metadata header and extracted text body.


URL Filtering

Control which URLs get crawled with regex patterns:

# Only crawl blog pages
crawlvox crawl https://example.com -i "/blog/"

# Skip admin and login pages
crawlvox crawl https://example.com -e "/admin/" -e "/login/"

# Combine: product pages only, skip archived
crawlvox crawl https://example.com -i "/products/" -e "/archived/"

Deny patterns (-e) always take priority over allow patterns (-i).


Database

All crawl data is stored in a local SQLite database (default: crawlvox.db) using WAL mode for safe concurrent access.

Tables:

Table Purpose
pages URL, status code, extracted text, metadata, timestamps
links Source page, target URL, anchor text, internal/external classification
images Source page, image URL, alt text, local file path, SHA256 hash
documents PDF/document processing results
runs Crawl run metadata, status, configuration
run_pages Maps pages to the runs that crawled them

Query directly with any SQLite client:

sqlite3 crawlvox.db "SELECT url, title FROM pages LIMIT 10"

Architecture

URL Input
    |
    v
[Scope Checker] --> robots.txt / sitemap.xml
    |
    v
[Worker Pool] ---> [Fetcher] ---> httpx (static)
    |                  |
    |                  +---------> [Playwright] (dynamic fallback)
    |
    v
[Content Extractor] --> trafilatura (primary)
    |                       |
    |                       +--> readability-lxml (fallback)
    |
    +---> [Metadata Extractor] --> OG, Twitter, canonical, lang
    +---> [Link Extractor] ------> internal/external classification
    +---> [Image Extractor] -----> optional binary download + dedup
    +---> [Document Processor] --> PDF text extraction + OCR
    |
    v
[SQLite Storage] --> WAL mode, normalized URLs, run tracking
    |
    v
[Export] --> JSONL / CSV / Markdown

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with verbose output
pytest -v

# Run a specific test file
pytest tests/test_fetcher.py

# Type checking
mypy src/crawlvox/

Responsible Use

CrawlVox is designed with ethical crawling in mind:

  • robots.txt is respected by default — disable only when you have explicit permission
  • Rate limiting prevents server overload — default 2 req/sec per domain
  • Same-domain scope prevents unintended cross-site crawling
  • Crawl-delay headers are automatically honored

Please use this tool responsibly:

  1. Only crawl websites you have permission to access
  2. Respect robots.txt directives and website terms of service
  3. Use appropriate rate limits to avoid overloading servers
  4. Comply with all applicable laws in your jurisdiction, including GDPR, CCPA, CFAA, and local equivalents
  5. Do not use extracted content in ways that violate copyright or intellectual property rights
  6. When in doubt, ask the website owner for permission

This tool is provided for personal, educational, and legitimate research purposes. The authors assume no liability for misuse.


License

MIT


Built with Python, asyncio, and respect for the web.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlvox-0.1.0.tar.gz (289.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlvox-0.1.0-py3-none-any.whl (57.8 kB view details)

Uploaded Python 3

File details

Details for the file crawlvox-0.1.0.tar.gz.

File metadata

  • Download URL: crawlvox-0.1.0.tar.gz
  • Upload date:
  • Size: 289.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawlvox-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1fd4b11c151fe3add5e5c8a9793f97cb845306fc909ac34a67571a1903d51401
MD5 da3ee26f7013f4cff434ae148a45ed0e
BLAKE2b-256 e24b508d32f4a6e53be2e3309c8d197d21f24d984c00771894f6f30b1a92ddda

See more details on using hashes here.

File details

Details for the file crawlvox-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: crawlvox-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawlvox-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d057aec1b11a2575fe107a1a413e4e6fcd27ee03bc559f58010096cdb115dc0c
MD5 61d3817ca71c04ff2a41091799ea91eb
BLAKE2b-256 aa2992659378a19df7c24822b07f5036023cff885ae0c61cd301dccbe6c730e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page