A thin, configurable Python crawler/scraper with sitemap and seed-url modes
Project description
mlcrawler
A thin, configurable Python 3.11+ crawler/scraper with both CLI and library API support, featuring sitemap and seed-url modes.
Features
- Library API: Use as a Python library with callback-based processing
- Command Line: Full-featured CLI for standalone crawling
- Sitemap mode: Automatically discover and parse XML sitemaps
- Seed mode: Crawl from starting URLs with link following
- Content extraction: Extract main articles using trafilatura or full page content
- Markdown output: Convert HTML to clean Markdown files
- Metadata tracking: Store fetch metadata in JSON sidecar files
- Configurable: TOML configuration with multi-file merging
- Polite crawling: Respects rate limits, robots.txt, and concurrency controls
- Extensible: Modular architecture with event hooks
Installation
As a Library
# Install from PyPI (once published)
pip install mlcrawler
# Or with uv
uv add mlcrawler
# Or install locally in development mode
uv pip install -e .
For CLI Usage
This project uses uv for dependency management:
# Clone the repository
git clone <repository-url>
cd mlcrawler
# Install dependencies
uv sync
Quick Start
Library API
import asyncio
from mlcrawler import Crawler, Page
async def process_page(page: Page):
"""Called for each crawled page."""
print(f"{page.title}: {page.url}")
print(f"Content: {len(page.markdown)} chars")
async def main():
crawler = Crawler(
max_depth=2,
max_pages=50,
follow_links=True,
)
await crawler.crawl(
"https://example.com",
callback=process_page
)
asyncio.run(main())
Full Library API Documentation →
Command Line Interface
# Crawl using a sitemap
uv run mlcrawler --sitemap https://example.com/sitemap.xml
# Crawl from seed URLs with link following
uv run mlcrawler --url https://example.com --follow --max-depth 2
# Use configuration file
uv run mlcrawler --config mlcrawler.toml
Configuration
mlcrawler uses Dynaconf for unified configuration management with support for:
- Configuration files (TOML, JSON, YAML)
- Environment variables (prefix:
MLCRAWLER_) - CLI overrides
Configuration Precedence (highest to lowest):
- CLI arguments (
--url,--sitemap,--output, etc.) - Environment variables (
MLCRAWLER_MODE,MLCRAWLER_USER_AGENT, etc.) - Configuration files (later files override earlier ones)
- Built-in defaults
Using Multiple Configuration Sources
# Use config files with CLI overrides
uv run mlcrawler crawl --config defaults.toml --config site-specific.toml --output ./my-output
# Environment variables override config files
export MLCRAWLER_USER_AGENT="MyBot/1.0"
export MLCRAWLER_RATE_LIMIT__PER_HOST_DELAY_MS=1000
uv run mlcrawler crawl --config mysite.toml
# CLI args override everything
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml
Example Configuration
# Crawl mode: 'sitemap' or 'seed'
mode = "sitemap"
# User agent string
user_agent = "mlcrawler/0.1 (+https://yoursite.com/contact)"
# Seed URLs for sitemap discovery
seeds = ["https://example.com"]
[limits]
max_pages = 100 # 0 = unlimited
[concurrency]
global = 8
per_host = 4
[rate_limit]
per_host_delay_ms = 500
[output]
dir = "output"
metadata_backend = "json"
[sitemap]
# url = "https://example.com/sitemap.xml" # Optional
use_lastmod = true
[extract]
main_article = false # Use trafilatura for main content extraction
See examples/defaults.toml and examples/site.example.toml for complete configuration examples.
Environment Variables
All configuration options can be set via environment variables using the MLCRAWLER_ prefix:
# Basic settings
export MLCRAWLER_MODE=sitemap
export MLCRAWLER_USER_AGENT="MyBot/1.0 (+https://mysite.com)"
export MLCRAWLER_SAME_DOMAIN_ONLY=true
# Nested settings use double underscores
export MLCRAWLER_OUTPUT__DIR=./my-output
export MLCRAWLER_OUTPUT__METADATA_BACKEND=json
export MLCRAWLER_CONCURRENCY__GLOBAL=4
export MLCRAWLER_CONCURRENCY__PER_HOST=2
export MLCRAWLER_RATE_LIMIT__PER_HOST_DELAY_MS=1000
# Arrays can be comma-separated
export MLCRAWLER_SEEDS="https://example.com,https://example.org"
export MLCRAWLER_FILTER__EXTRA_REMOVE="nav,.ads,.sidebar"
# Then run without config files
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml
Usage
Command Line Interface
# Show help
uv run mlcrawler --help
# Crawl with sitemap URL
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml
# Crawl with seed URLs (seed mode)
uv run mlcrawler crawl --url https://example.com --url https://example.com/page2
# Crawl with configuration file
uv run mlcrawler crawl --config mysite.toml
# Crawl with CLI overrides
uv run mlcrawler crawl --config defaults.toml --sitemap https://example.com/sitemap.xml --output ./my-output
# Verbose logging
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml --verbose
# Multiple configuration files (later ones override earlier)
uv run mlcrawler crawl --config defaults.toml --config site.toml
Configuration Options
--config(-c): Configuration file(s) to load (TOML). Can be specified multiple times.--url(-u): Seed URL(s) for crawling (implies seed mode). Can be specified multiple times.--sitemap(-s): Sitemap XML URL (implies sitemap mode).--output(-o): Override output directory--verbose(-v): Enable verbose logging
Note: You cannot specify both --url and --sitemap in the same command.
Output Structure
mlcrawler creates the following output structure:
output/
└── example.com/
├── index.md # Converted content
├── index.meta.json # Metadata
├── about.md
├── about.meta.json
└── blog/
├── post-1.md
└── post-1.meta.json
Metadata Format
Each .meta.json file contains:
{
"url": "https://example.com/page",
"title": "Page Title",
"fetched_at": "2025-01-15T10:30:00",
"status": 200,
"content_hash": "sha256-hash",
"extraction_mode": "article",
"trafilatura_metadata": {
"author": "Author Name",
"date": "2025-01-01"
}
}
Milestone 1 Features
This is Milestone 1 implementation with the following features:
- ✅ Sitemap discovery and parsing (index and urlset)
- ✅ HTTP fetching with basic politeness (rate limiting, concurrency)
- ✅ HTML content extraction (full page + trafilatura main article)
- ✅ Markdown conversion using markdownify
- ✅ File output with metadata sidecars
- ✅ TOML configuration with deep merging
- ✅ CLI interface with Typer
Coming in Future Milestones
- M2: Seed mode with dynamic link discovery
- M3: Disk caching with ETag/Last-Modified support
- M4: robots.txt obedience and crawl state persistence
Development
This project follows these principles:
- Always use
uv(no pip or python -m) - Keep modules small, composable, and focused
- Tests assert observable behavior, not internals
- Apache-2.0 license
Running Tests
# Install dev dependencies
uv add --dev pytest
# Run tests (when implemented)
uv run pytest
Code Quality
# Install dev tools
uv add --dev ruff mypy
# Run linting
uvx ruff check .
# Run type checking
uvx mypy src/
License
Apache-2.0
For more examples and advanced usage, see the examples/ directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlcrawler-0.0.1.tar.gz.
File metadata
- Download URL: mlcrawler-0.0.1.tar.gz
- Upload date:
- Size: 108.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d1d2d77b6867c60f20d4101b7f764edff1491a7c726c81c3a2e1a86a58b07f2
|
|
| MD5 |
e2a141474779e8d11daf02e8135a1f61
|
|
| BLAKE2b-256 |
e3d405cc5a400249f17368c4abdff26d43bdd5355c0931bea3f61bcb28c5336a
|
File details
Details for the file mlcrawler-0.0.1-py3-none-any.whl.
File metadata
- Download URL: mlcrawler-0.0.1-py3-none-any.whl
- Upload date:
- Size: 32.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97f55ef3d773834af7386d4c46a88cb2216c17d7fc8710fcc31eee14e7abc732
|
|
| MD5 |
0861133f9b0d7d3fb60d8e0ac8664e2e
|
|
| BLAKE2b-256 |
ab5ffad194e10801b5419dd8ff877be755dda53449014ef872763661b0fe7991
|