Web scraping and content extraction tool
Project description
CrawlVox
Extract clean, structured content from any website.
A command-line web crawler that handles static HTML, JavaScript-heavy SPAs, and PDFs — built for content migration, web archiving, and research workflows.
Crawl websites at scale | Extract articles, metadata, images & PDFs | Export to JSONL, CSV or Markdown
Personal Use & Legal Notice
CrawlVox is intended for personal and educational use only. Always respect the laws of your jurisdiction, website terms of service, and robots.txt directives before crawling any website. The authors are not responsible for misuse. By using this software, you agree to comply with all applicable local, national, and international laws, including but not limited to data protection regulations (GDPR, CCPA), computer fraud laws, and intellectual property rights. Do not use this tool to scrape websites without authorization.
Why CrawlVox?
Most web scrapers give you raw HTML and leave you to figure out the content. CrawlVox gives you clean, readable text with full metadata — ready to use.
crawlvox crawl https://example.com --max-pages 50 --export-jsonl results.jsonl
That's it. 50 pages crawled, content extracted, metadata captured, exported to a file.
What Makes It Different
- Dual extraction engine — trafilatura for speed, readability-lxml as fallback. You get content, not boilerplate.
- Smart JS rendering — Only fires up a browser when static HTML doesn't have enough content. Saves time without missing SPAs.
- PDF pipeline — Finds PDFs during crawl, extracts text with OCR support. No separate tool needed.
- Image deduplication — Downloads images with two-tier dedup (URL + content hash). No duplicates across runs.
- Resumable crawls — Ctrl+C anytime. Resume later with
--resume. State is saved automatically. - Ethical by default — Respects robots.txt, rate-limits per domain, backs off on 429/503 errors.
Features
Crawling
|
Content Extraction
|
JavaScript & Documents
|
Storage & Export
|
Reliability
|
Ethics & Safety
|
Installation
Requirements: Python 3.11+
# Clone the repository
git clone https://github.com/your-username/crawlvox.git
cd crawlvox
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
source .venv/Scripts/activate # Windows (Git Bash)
.venv\Scripts\activate # Windows (CMD)
# Install CrawlVox
pip install -e .
# Install Playwright browser (for JavaScript rendering)
playwright install chromium
Quick Start
# Crawl a website (100 pages by default)
crawlvox crawl https://example.com
# Crawl with limits and auto-export
crawlvox crawl https://example.com --max-pages 50 --max-depth 2 --export-jsonl results.jsonl
# Enable JavaScript rendering for SPAs
crawlvox crawl https://spa-site.com --dynamic always
# Download images too
crawlvox crawl https://example.com --download-images
# Process PDFs found during crawl
crawlvox crawl https://example.com --process-documents
# Resume an interrupted crawl
crawlvox crawl https://example.com --resume
# Check crawl history
crawlvox status
# Export data from database
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl
Commands
crawlvox crawl
Crawl one or more websites and extract content.
crawlvox crawl [OPTIONS] URLS...
Crawling Options
| Option | Default | Description |
|---|---|---|
-w, --workers |
10 | Concurrent workers (1-100) |
-p, --max-pages |
100 | Maximum pages to crawl |
--max-depth |
3 | Maximum link depth (1-20) |
-t, --timeout |
30.0 | Request timeout in seconds |
-r, --rate-limit |
2.0 | Max requests/second per domain |
--no-robots |
off | Disable robots.txt respect |
--same-domain / --cross-domain |
same-domain | Domain scope control |
-i, --include |
none | Regex allowlist for URLs (repeatable) |
-e, --exclude |
none | Regex denylist for URLs (repeatable) |
--user-agent |
CrawlVox/0.1 | HTTP User-Agent header |
--cookie-file |
none | Path to cookie file (LWP format) |
JavaScript Rendering
| Option | Default | Description |
|---|---|---|
--dynamic |
auto | off = static only, auto = fallback on low content, always = render all |
--min-content-length |
200 | Character threshold before triggering JS fallback |
# Force JS rendering on all pages
crawlvox crawl https://spa-site.com --dynamic always
# Static-only (fastest)
crawlvox crawl https://static-site.com --dynamic off
Image Downloading
| Option | Default | Description |
|---|---|---|
--download-images |
off | Enable binary image downloading |
--image-dir |
images | Directory for downloaded images |
--image-scope |
same-domain | same-domain or all (includes CDNs) |
--max-image-size |
10mb | Max image file size |
# Download images from any source
crawlvox crawl https://example.com --download-images --image-scope all
PDF/Document Processing
| Option | Default | Description |
|---|---|---|
--process-documents |
off | Enable PDF text extraction via Docling |
--ocr-mode |
auto | off, auto, or always |
--ocr-language |
en | OCR language code |
--max-document-size |
50mb | Max PDF file size |
--max-document-pages |
500 | Max PDF page count |
# Process PDFs with forced OCR
crawlvox crawl https://example.com --process-documents --ocr-mode always
Output & Resume
| Option | Default | Description |
|---|---|---|
--store-html |
off | Store raw HTML in database |
--export-jsonl |
none | Auto-export to JSONL after crawl |
-q, --quiet |
off | Suppress progress output |
-l, --log-level |
INFO | DEBUG, INFO, WARNING, ERROR |
--resume |
off | Resume an interrupted crawl |
--recrawl |
off | Re-process pages on resume |
# Start a large crawl (Ctrl+C to interrupt safely)
crawlvox crawl https://large-site.com --max-pages 1000
# Resume where you left off
crawlvox crawl https://large-site.com --max-pages 1000 --resume
crawlvox export
Export crawl data to a file.
crawlvox export [OPTIONS]
Formats:
# JSONL (one JSON object per line)
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl
# CSV (flat table)
crawlvox export -d crawlvox.db -f csv -o output.csv
# Markdown (one .md file per page)
crawlvox export -d crawlvox.db -f markdown -o ./pages/
Filtering:
# By run ID
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --run-id abc123
# By URL pattern
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --url-pattern "%/blog/%"
# By status
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status ok
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status error
# Include raw HTML
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --include-html
crawlvox status
Show crawl run history and storage statistics.
crawlvox status
crawlvox status -d myproject.db
crawlvox status --limit 20
crawlvox purge
Delete old crawl runs and their associated data.
# Purge a specific run
crawlvox purge --run abc123def456
# Purge runs older than 7 days
crawlvox purge --older-than 7d
# Purge runs older than 2 weeks
crawlvox purge --older-than 2w
Active (running) crawls are never purged.
Output Format
JSONL
Each line is a self-contained JSON object:
{
"type": "page",
"url": "https://example.com/about",
"final_url": "https://example.com/about",
"title": "About Us",
"description": "Learn more about our company",
"text": "Extracted main content text...",
"fetched_at": "2025-01-15T10:30:00+00:00",
"status_code": 200,
"fetch_method": "static",
"extraction_method": "trafilatura",
"canonical_url": "https://example.com/about",
"og_title": "About Us | Example",
"og_description": "Learn more about our company",
"og_image": "https://example.com/og-about.jpg",
"language": "en",
"content_type": "text/html",
"error": null,
"images": [
{
"src": "https://example.com/team.jpg",
"alt": "Our team",
"local_path": null
}
]
}
CSV
Flat table with columns: url, final_url, title, description, text, status_code, content_type, fetch_method, extraction_method, language, fetched_at, error.
Markdown
One .md file per page with metadata header and extracted text body.
URL Filtering
Control which URLs get crawled with regex patterns:
# Only crawl blog pages
crawlvox crawl https://example.com -i "/blog/"
# Skip admin and login pages
crawlvox crawl https://example.com -e "/admin/" -e "/login/"
# Combine: product pages only, skip archived
crawlvox crawl https://example.com -i "/products/" -e "/archived/"
Deny patterns (-e) always take priority over allow patterns (-i).
Database
All crawl data is stored in a local SQLite database (default: crawlvox.db) using WAL mode for safe concurrent access.
Tables:
| Table | Purpose |
|---|---|
pages |
URL, status code, extracted text, metadata, timestamps |
links |
Source page, target URL, anchor text, internal/external classification |
images |
Source page, image URL, alt text, local file path, SHA256 hash |
documents |
PDF/document processing results |
runs |
Crawl run metadata, status, configuration |
run_pages |
Maps pages to the runs that crawled them |
Query directly with any SQLite client:
sqlite3 crawlvox.db "SELECT url, title FROM pages LIMIT 10"
Architecture
URL Input
|
v
[Scope Checker] --> robots.txt / sitemap.xml
|
v
[Worker Pool] ---> [Fetcher] ---> httpx (static)
| |
| +---------> [Playwright] (dynamic fallback)
|
v
[Content Extractor] --> trafilatura (primary)
| |
| +--> readability-lxml (fallback)
|
+---> [Metadata Extractor] --> OG, Twitter, canonical, lang
+---> [Link Extractor] ------> internal/external classification
+---> [Image Extractor] -----> optional binary download + dedup
+---> [Document Processor] --> PDF text extraction + OCR
|
v
[SQLite Storage] --> WAL mode, normalized URLs, run tracking
|
v
[Export] --> JSONL / CSV / Markdown
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with verbose output
pytest -v
# Run a specific test file
pytest tests/test_fetcher.py
# Type checking
mypy src/crawlvox/
Responsible Use
CrawlVox is designed with ethical crawling in mind:
- robots.txt is respected by default — disable only when you have explicit permission
- Rate limiting prevents server overload — default 2 req/sec per domain
- Same-domain scope prevents unintended cross-site crawling
- Crawl-delay headers are automatically honored
Please use this tool responsibly:
- Only crawl websites you have permission to access
- Respect
robots.txtdirectives and website terms of service - Use appropriate rate limits to avoid overloading servers
- Comply with all applicable laws in your jurisdiction, including GDPR, CCPA, CFAA, and local equivalents
- Do not use extracted content in ways that violate copyright or intellectual property rights
- When in doubt, ask the website owner for permission
This tool is provided for personal, educational, and legitimate research purposes. The authors assume no liability for misuse.
License
Built with Python, asyncio, and respect for the web.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlvox-0.1.0.tar.gz.
File metadata
- Download URL: crawlvox-0.1.0.tar.gz
- Upload date:
- Size: 289.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fd4b11c151fe3add5e5c8a9793f97cb845306fc909ac34a67571a1903d51401
|
|
| MD5 |
da3ee26f7013f4cff434ae148a45ed0e
|
|
| BLAKE2b-256 |
e24b508d32f4a6e53be2e3309c8d197d21f24d984c00771894f6f30b1a92ddda
|
File details
Details for the file crawlvox-0.1.0-py3-none-any.whl.
File metadata
- Download URL: crawlvox-0.1.0-py3-none-any.whl
- Upload date:
- Size: 57.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d057aec1b11a2575fe107a1a413e4e6fcd27ee03bc559f58010096cdb115dc0c
|
|
| MD5 |
61d3817ca71c04ff2a41091799ea91eb
|
|
| BLAKE2b-256 |
aa2992659378a19df7c24822b07f5036023cff885ae0c61cd301dccbe6c730e1
|