Skip to main content

A thin, configurable Python crawler/scraper with sitemap and seed-url modes

Project description

mlcrawler

A thin, configurable Python 3.11+ crawler/scraper with both CLI and library API support, featuring sitemap and seed-url modes.

Features

  • Library API: Use as a Python library with callback-based processing
  • Command Line: Full-featured CLI for standalone crawling
  • Sitemap mode: Automatically discover and parse XML sitemaps
  • Seed mode: Crawl from starting URLs with link following
  • Content extraction: Extract main articles using trafilatura or full page content
  • Markdown output: Convert HTML to clean Markdown files
  • Metadata tracking: Store fetch metadata in JSON sidecar files
  • Configurable: TOML configuration with multi-file merging
  • Polite crawling: Respects rate limits, robots.txt, and concurrency controls
  • Extensible: Modular architecture with event hooks

Installation

As a Library

# Install from PyPI (once published)
pip install mlcrawler

# Or with uv
uv add mlcrawler

# Or install locally in development mode
uv pip install -e .

For CLI Usage

This project uses uv for dependency management:

# Clone the repository
git clone <repository-url>
cd mlcrawler

# Install dependencies
uv sync

Quick Start

Library API

import asyncio
from mlcrawler import Crawler, Page

async def process_page(page: Page):
    """Called for each crawled page."""
    print(f"{page.title}: {page.url}")
    print(f"Content: {len(page.markdown)} chars")

async def main():
    crawler = Crawler(
        max_depth=2,
        max_pages=50,
        follow_links=True,
    )

    await crawler.crawl(
        "https://example.com",
        callback=process_page
    )

asyncio.run(main())

Full Library API Documentation →

Command Line Interface

# Crawl using a sitemap
uv run mlcrawler --sitemap https://example.com/sitemap.xml

# Crawl from seed URLs with link following
uv run mlcrawler --url https://example.com --follow --max-depth 2

# Use configuration file
uv run mlcrawler --config mlcrawler.toml

Configuration

mlcrawler uses Dynaconf for unified configuration management with support for:

  • Configuration files (TOML, JSON, YAML)
  • Environment variables (prefix: MLCRAWLER_)
  • CLI overrides

Configuration Precedence (highest to lowest):

  1. CLI arguments (--url, --sitemap, --output, etc.)
  2. Environment variables (MLCRAWLER_MODE, MLCRAWLER_USER_AGENT, etc.)
  3. Configuration files (later files override earlier ones)
  4. Built-in defaults

Using Multiple Configuration Sources

# Use config files with CLI overrides
uv run mlcrawler crawl --config defaults.toml --config site-specific.toml --output ./my-output

# Environment variables override config files
export MLCRAWLER_USER_AGENT="MyBot/1.0"
export MLCRAWLER_RATE_LIMIT__PER_HOST_DELAY_MS=1000
uv run mlcrawler crawl --config mysite.toml

# CLI args override everything
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml

Example Configuration

# Crawl mode: 'sitemap' or 'seed'
mode = "sitemap"

# User agent string
user_agent = "mlcrawler/0.1 (+https://yoursite.com/contact)"

# Seed URLs for sitemap discovery
seeds = ["https://example.com"]

[limits]
max_pages = 100  # 0 = unlimited

[concurrency]
global = 8
per_host = 4

[rate_limit]
per_host_delay_ms = 500

[output]
dir = "output"
metadata_backend = "json"

[sitemap]
# url = "https://example.com/sitemap.xml"  # Optional
use_lastmod = true

[extract]
main_article = false  # Use trafilatura for main content extraction

See examples/defaults.toml and examples/site.example.toml for complete configuration examples.

Environment Variables

All configuration options can be set via environment variables using the MLCRAWLER_ prefix:

# Basic settings
export MLCRAWLER_MODE=sitemap
export MLCRAWLER_USER_AGENT="MyBot/1.0 (+https://mysite.com)"
export MLCRAWLER_SAME_DOMAIN_ONLY=true

# Nested settings use double underscores
export MLCRAWLER_OUTPUT__DIR=./my-output
export MLCRAWLER_OUTPUT__METADATA_BACKEND=json
export MLCRAWLER_CONCURRENCY__GLOBAL=4
export MLCRAWLER_CONCURRENCY__PER_HOST=2
export MLCRAWLER_RATE_LIMIT__PER_HOST_DELAY_MS=1000

# Arrays can be comma-separated
export MLCRAWLER_SEEDS="https://example.com,https://example.org"
export MLCRAWLER_FILTER__EXTRA_REMOVE="nav,.ads,.sidebar"

# Then run without config files
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml

Usage

Command Line Interface

# Show help
uv run mlcrawler --help

# Crawl with sitemap URL
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml

# Crawl with seed URLs (seed mode)
uv run mlcrawler crawl --url https://example.com --url https://example.com/page2

# Crawl with configuration file
uv run mlcrawler crawl --config mysite.toml

# Crawl with CLI overrides
uv run mlcrawler crawl --config defaults.toml --sitemap https://example.com/sitemap.xml --output ./my-output

# Verbose logging
uv run mlcrawler crawl --sitemap https://example.com/sitemap.xml --verbose

# Multiple configuration files (later ones override earlier)
uv run mlcrawler crawl --config defaults.toml --config site.toml

Configuration Options

  • --config (-c): Configuration file(s) to load (TOML). Can be specified multiple times.
  • --url (-u): Seed URL(s) for crawling (implies seed mode). Can be specified multiple times.
  • --sitemap (-s): Sitemap XML URL (implies sitemap mode).
  • --output (-o): Override output directory
  • --verbose (-v): Enable verbose logging

Note: You cannot specify both --url and --sitemap in the same command.

Output Structure

mlcrawler creates the following output structure:

output/
└── example.com/
    ├── index.md              # Converted content
    ├── index.meta.json       # Metadata
    ├── about.md
    ├── about.meta.json
    └── blog/
        ├── post-1.md
        └── post-1.meta.json

Metadata Format

Each .meta.json file contains:

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "fetched_at": "2025-01-15T10:30:00",
  "status": 200,
  "content_hash": "sha256-hash",
  "extraction_mode": "article",
  "trafilatura_metadata": {
    "author": "Author Name",
    "date": "2025-01-01"
  }
}

Milestone 1 Features

This is Milestone 1 implementation with the following features:

  • ✅ Sitemap discovery and parsing (index and urlset)
  • ✅ HTTP fetching with basic politeness (rate limiting, concurrency)
  • ✅ HTML content extraction (full page + trafilatura main article)
  • ✅ Markdown conversion using markdownify
  • ✅ File output with metadata sidecars
  • ✅ TOML configuration with deep merging
  • ✅ CLI interface with Typer

Coming in Future Milestones

  • M2: Seed mode with dynamic link discovery
  • M3: Disk caching with ETag/Last-Modified support
  • M4: robots.txt obedience and crawl state persistence

Development

This project follows these principles:

  • Always use uv (no pip or python -m)
  • Keep modules small, composable, and focused
  • Tests assert observable behavior, not internals
  • Apache-2.0 license

Running Tests

# Install dev dependencies
uv add --dev pytest

# Run tests (when implemented)
uv run pytest

Code Quality

# Install dev tools
uv add --dev ruff mypy

# Run linting
uvx ruff check .

# Run type checking
uvx mypy src/

License

Apache-2.0


For more examples and advanced usage, see the examples/ directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlcrawler-0.2.0.tar.gz (132.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlcrawler-0.2.0-py3-none-any.whl (36.2 kB view details)

Uploaded Python 3

File details

Details for the file mlcrawler-0.2.0.tar.gz.

File metadata

  • Download URL: mlcrawler-0.2.0.tar.gz
  • Upload date:
  • Size: 132.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for mlcrawler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c886abc6a365732a9f3d39918f7f96fade78bc85d941c8126e925630e1845872
MD5 9d32591ff69a31ab3548a08e106965ed
BLAKE2b-256 98058d7ea761a6a8fee3dd231f4568b5a426d955ca65bc26ce60228eaf620d38

See more details on using hashes here.

File details

Details for the file mlcrawler-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mlcrawler-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 36.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for mlcrawler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 598db0542c651eb8f5741b935a3a97649ac3cf786631a139ebd5c035bff38a31
MD5 937d5f89dc79a69ec682f6d3ebe704ca
BLAKE2b-256 642470fdb77c22d6f765f1f7027ee34c4233d1edbe615de6e924e9f728fe2a9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page