Skip to main content

A command-line tool for downloading and converting website content from sitemaps to markdown format

Project description

dl-md

A command-line tool for downloading and converting website content from sitemaps to markdown format with organized directory structure.

Overview

dl-md extracts URLs from XML sitemaps and downloads each page as markdown, automatically organizing the content into a directory structure that mirrors the website's URL hierarchy.

Features

  • Sitemap Processing: Extracts URLs from XML sitemaps using trafilatura
  • Automatic Directory Structure: Creates directories based on URL paths
  • Markdown Conversion: Downloads and converts web pages to clean markdown format
  • Progress Reporting: Shows real-time progress as URLs are processed
  • Dry Run Mode: Preview what would be downloaded without actually fetching content
  • Verbose Output: Detailed logging for troubleshooting
  • Comprehensive Testing: Full test suite with 85% code coverage

Installation

Using Poetry (Recommended)

git clone https://github.com/donbowman/dl-md
cd dl-md
poetry install

Using pip

pip install dl-md

Usage

Basic Usage

dl <sitemap-url> [<sitemap-url> ...]

Example

Download all 'anyx-guide' and 'ufaq' post types from the Agilicus website:

poetry run dl https://www.agilicus.com/anyx-guide-sitemap.xml https://www.agilicus.com/ufaq-sitemap.xml

or, if installed:

dl https://www.agilicus.com/anyx-guide-sitemap.xml https://www.agilicus.com/ufaq-sitemap.xml

This command will:

  1. Fetch both sitemap files from www.agilicus.com
  2. Extract all URLs from both sitemaps
  3. Create a directory structure like:
    agilicus.com/
    ├── anyx-guide/
    │   ├── getting-started.md
    │   ├── installation.md
    │   └── configuration.md
    └── ufaq/
        ├── troubleshooting.md
        ├── common-issues.md
        └── support.md
    
  4. Download each URL and convert it to clean markdown format

Command Options

  • -v, --verbose: Enable detailed output showing progress and debugging information
  • -o, --output-dir TEXT: Specify output directory (default: current directory)
  • --dry-run: Show what would be downloaded without actually fetching content
  • --help: Show help message and exit

Examples

Verbose output with custom directory:

dl --verbose --output-dir ./downloads https://example.com/sitemap.xml

Dry run to preview structure:

dl --dry-run https://example.com/sitemap.xml

Multiple sitemaps:

dl https://site1.com/sitemap.xml https://site2.com/sitemap.xml

Directory Structure

The tool creates directories based on URL structure:

URL Directory Filename
https://www.example.com/blog/post1 example.com/blog/ post1.md
https://example.com/docs/guide example.com/docs/ guide.md
https://example.com/ example.com/ index.md

How It Works

  1. Sitemap Parsing: Uses trafilatura's sitemap_search() to extract URLs from XML sitemaps
  2. URL Processing: Parses each URL to determine directory structure and filename
  3. Content Fetching: Downloads each page using trafilatura's fetch_url()
  4. Markdown Conversion: Converts HTML content to clean markdown using trafilatura's extract()
  5. File Organization: Saves markdown files in organized directory structure

Development

Running Tests

poetry run pytest

Running Tests with Coverage

poetry run pytest --cov=dl_md --cov-report=term-missing

Project Structure

dl-md/
├── dl_md/
│   ├── __init__.py
│   └── cli.py          # Main CLI implementation
├── tests/
│   ├── __init__.py
│   └── test_cli.py     # Comprehensive test suite
├── pyproject.toml      # Project configuration
├── poetry.lock         # Dependency lock file
└── README.md           # This file

Dependencies

  • click: Command-line interface framework
  • trafilatura: Web scraping and content extraction
  • requests: HTTP library for web requests
  • pytest: Testing framework (development)
  • pytest-cov: Coverage reporting (development)

Error Handling

The tool gracefully handles various error conditions:

  • Network errors: Continues processing other URLs if one fails
  • Invalid sitemaps: Reports errors and continues with other sitemaps
  • Content extraction failures: Logs failures and continues processing
  • File system errors: Reports permission or disk space issues

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite: poetry run pytest
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions:

  • Check the verbose output with -v flag for debugging information
  • Review the test suite for usage examples
  • Open an issue on the project repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dl_md-0.0.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dl_md-0.0.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file dl_md-0.0.1.tar.gz.

File metadata

  • Download URL: dl_md-0.0.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dl_md-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ca03c966b537ba234ace9911875172cedd1421429574949c6d407ed60e4a50c5
MD5 ca6b0fa3c8f3384ffc24bace8d1aa7f0
BLAKE2b-256 0976c8e2ef16e8517d302f1d49953bf2ccb8b4ac3ccedd32e882b1f8f52680cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for dl_md-0.0.1.tar.gz:

Publisher: publish.yml on donbowman/dl-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dl_md-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: dl_md-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dl_md-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 73d91290441201b668d182d69ee69d3b2f32479e24fdfc6a3a9400884bd526b7
MD5 37fc48011703c274f7abf12f78cf0c69
BLAKE2b-256 77c39810a2553d172159edd3bdf86df92d38a27cc1070dcfd93203a018e99973

See more details on using hashes here.

Provenance

The following attestation bundles were made for dl_md-0.0.1-py3-none-any.whl:

Publisher: publish.yml on donbowman/dl-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page