A thin, configurable Python crawler/scraper with sitemap and seed-url modes

Project description

mlcrawler

A thin, configurable Python 3.11+ crawler/scraper with both CLI and library API support, featuring sitemap and seed-url modes.

Features

Library API: Use as a Python library with callback-based processing
Command Line: Full-featured CLI for standalone crawling
Sitemap mode: Automatically discover and parse XML sitemaps
Seed mode: Crawl from starting URLs with link following
Content extraction: Extract main articles using trafilatura or full page content
Markdown output: Convert HTML to clean Markdown files
Metadata tracking: Store fetch metadata in JSON sidecar files
Configurable: TOML configuration with multi-file merging
Polite crawling: Respects rate limits, robots.txt, and concurrency controls
Extensible: Modular architecture with event hooks

Installation

As a Library

# Install from PyPI (once published)
pip install mlcrawler

# Or with uv
uv add mlcrawler

# Or install locally in development mode
uv pip install -e .

For CLI Usage

This project uses uv for dependency management:

# Clone the repository
git clone <repository-url>
cd mlcrawler

# Install dependencies
uv sync

Quick Start

Library API

import asyncio
from mlcrawler import Crawler, Page

async def process_page(page: Page):
    """Called for each crawled page."""
    print(f"{page.title}: {page.url}")
    print(f"Content: {len(page.markdown)} chars")

async def main():
    crawler = Crawler(
        max_depth=2,
        max_pages=50,
        follow_links=True,
    )

    await crawler.crawl(
        "https://example.com",
        callback=process_page
    )

asyncio.run(main())

Full Library API Documentation →

Command Line Interface

# Crawl using a sitemap
uv run mlcrawler --sitemap https://example.com/sitemap.xml

# Crawl from seed URLs with link following
uv run mlcrawler --url https://example.com --follow --max-depth 2

# Use configuration file
uv run mlcrawler --config mlcrawler.toml

Configuration

mlcrawler uses Dynaconf for unified configuration management with support for:

Configuration files (TOML, JSON, YAML)
Environment variables (prefix: MLCRAWLER_)
CLI overrides

Configuration Precedence (highest to lowest):

CLI arguments (--url, --sitemap, --output, etc.)
Environment variables (MLCRAWLER_MODE, MLCRAWLER_USER_AGENT, etc.)
Configuration files (later files override earlier ones)
Built-in defaults

Using Multiple Configuration Sources

# Use config files with CLI overrides
uv run mlcrawler crawl --config defaults.toml --config site-specific.toml --output ./my-output

# Environment variables override config files
export MLCRAWLER_USER_AGENT="MyBot/1.0"
export MLCRAWLER_RATE_LIMIT__PER_HOST_DELAY_MS=1000
uv run mlcrawler crawl --config mysite.toml

# CLI args override everything
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml

Example Configuration

# Crawl mode: 'sitemap' or 'seed'
mode = "sitemap"

# User agent string
user_agent = "mlcrawler/0.1 (+https://yoursite.com/contact)"

# Seed URLs for sitemap discovery
seeds = ["https://example.com"]

[limits]
max_pages = 100  # 0 = unlimited

[concurrency]
global = 8
per_host = 4

[rate_limit]
per_host_delay_ms = 500

[output]
dir = "output"
metadata_backend = "json"

[sitemap]
# url = "https://example.com/sitemap.xml"  # Optional
use_lastmod = true

[extract]
main_article = false  # Use trafilatura for main content extraction

See examples/defaults.toml and examples/site.example.toml for complete configuration examples.

Environment Variables

All configuration options can be set via environment variables using the MLCRAWLER_ prefix:

# Basic settings
export MLCRAWLER_MODE=sitemap
export MLCRAWLER_USER_AGENT="MyBot/1.0 (+https://mysite.com)"
export MLCRAWLER_SAME_DOMAIN_ONLY=true

# Nested settings use double underscores
export MLCRAWLER_OUTPUT__DIR=./my-output
export MLCRAWLER_OUTPUT__METADATA_BACKEND=json
export MLCRAWLER_CONCURRENCY__GLOBAL=4
export MLCRAWLER_CONCURRENCY__PER_HOST=2
export MLCRAWLER_RATE_LIMIT__PER_HOST_DELAY_MS=1000

# Arrays can be comma-separated
export MLCRAWLER_SEEDS="https://example.com,https://example.org"
export MLCRAWLER_FILTER__EXTRA_REMOVE="nav,.ads,.sidebar"

# Then run without config files
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml

Usage

Command Line Interface

# Show help
uv run mlcrawler --help

# Crawl with sitemap URL
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml

# Crawl with seed URLs (seed mode)
uv run mlcrawler crawl --url https://example.com --url https://example.com/page2

# Crawl with configuration file
uv run mlcrawler crawl --config mysite.toml

# Crawl with CLI overrides
uv run mlcrawler crawl --config defaults.toml --sitemap https://example.com/sitemap.xml --output ./my-output

# Verbose logging
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml --verbose

# Multiple configuration files (later ones override earlier)
uv run mlcrawler crawl --config defaults.toml --config site.toml

Configuration Options

--config (-c): Configuration file(s) to load (TOML). Can be specified multiple times.
--url (-u): Seed URL(s) for crawling (implies seed mode). Can be specified multiple times.
--sitemap (-s): Sitemap XML URL (implies sitemap mode).
--output (-o): Override output directory
--verbose (-v): Enable verbose logging

Note: You cannot specify both --url and --sitemap in the same command.

Output Structure

mlcrawler creates the following output structure:

output/
└── example.com/
    ├── index.md              # Converted content
    ├── index.meta.json       # Metadata
    ├── about.md
    ├── about.meta.json
    └── blog/
        ├── post-1.md
        └── post-1.meta.json

Metadata Format

Each .meta.json file contains:

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "fetched_at": "2025-01-15T10:30:00",
  "status": 200,
  "content_hash": "sha256-hash",
  "extraction_mode": "article",
  "trafilatura_metadata": {
    "author": "Author Name",
    "date": "2025-01-01"
  }
}

Milestone 1 Features

This is Milestone 1 implementation with the following features:

✅ Sitemap discovery and parsing (index and urlset)
✅ HTTP fetching with basic politeness (rate limiting, concurrency)
✅ HTML content extraction (full page + trafilatura main article)
✅ Markdown conversion using markdownify
✅ File output with metadata sidecars
✅ TOML configuration with deep merging
✅ CLI interface with Typer

Coming in Future Milestones

M2: Seed mode with dynamic link discovery
M3: Disk caching with ETag/Last-Modified support
M4: robots.txt obedience and crawl state persistence

Development

This project follows these principles:

Always use uv (no pip or python -m)
Keep modules small, composable, and focused
Tests assert observable behavior, not internals
Apache-2.0 license

Running Tests

# Install dev dependencies
uv add --dev pytest

# Run tests (when implemented)
uv run pytest

Code Quality

# Install dev tools
uv add --dev ruff mypy

# Run linting
uvx ruff check .

# Run type checking
uvx mypy src/

License

Apache-2.0

For more examples and advanced usage, see the examples/ directory.

Project details

Release history Release notifications | RSS feed

This version

0.2.0

Nov 17, 2025

0.0.1

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlcrawler-0.2.0.tar.gz (132.5 kB view details)

Uploaded Nov 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlcrawler-0.2.0-py3-none-any.whl (36.2 kB view details)

Uploaded Nov 17, 2025 Python 3

File details

Details for the file mlcrawler-0.2.0.tar.gz.

File metadata

Download URL: mlcrawler-0.2.0.tar.gz
Upload date: Nov 17, 2025
Size: 132.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for mlcrawler-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c886abc6a365732a9f3d39918f7f96fade78bc85d941c8126e925630e1845872`
MD5	`9d32591ff69a31ab3548a08e106965ed`
BLAKE2b-256	`98058d7ea761a6a8fee3dd231f4568b5a426d955ca65bc26ce60228eaf620d38`

See more details on using hashes here.

File details

Details for the file mlcrawler-0.2.0-py3-none-any.whl.

File metadata

Download URL: mlcrawler-0.2.0-py3-none-any.whl
Upload date: Nov 17, 2025
Size: 36.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for mlcrawler-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`598db0542c651eb8f5741b935a3a97649ac3cf786631a139ebd5c035bff38a31`
MD5	`937d5f89dc79a69ec682f6d3ebe704ca`
BLAKE2b-256	`642470fdb77c22d6f765f1f7027ee34c4233d1edbe615de6e924e9f728fe2a9b`

See more details on using hashes here.

mlcrawler 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

mlcrawler

Features

Installation

As a Library

For CLI Usage

Quick Start

Library API

Command Line Interface

Configuration

Configuration Precedence (highest to lowest):

Using Multiple Configuration Sources

Example Configuration

Environment Variables

Usage

Command Line Interface

Configuration Options

Output Structure

Metadata Format

Milestone 1 Features

Coming in Future Milestones

Development

Running Tests

Code Quality

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes