Skip to main content

No project description provided

Project description

dl-md

A command-line tool for downloading and converting website content from sitemaps to markdown format with organized directory structure.

Overview

dl-md extracts URLs from XML sitemaps and downloads each page as markdown, automatically organizing the content into a directory structure that mirrors the website's URL hierarchy.

Features

  • Sitemap Processing: Extracts URLs from XML sitemaps using trafilatura
  • Automatic Directory Structure: Creates directories based on URL paths
  • Markdown Conversion: Downloads and converts web pages to clean markdown format
  • Progress Reporting: Shows real-time progress as URLs are processed
  • Dry Run Mode: Preview what would be downloaded without actually fetching content
  • Verbose Output: Detailed logging for troubleshooting
  • Comprehensive Testing: Full test suite with 85% code coverage

Installation

Using Poetry (Recommended)

git clone https://github.com/donbowman/dl-md
cd dl-md
poetry install

Using pip

pip install dl-md

Usage

Basic Usage

dl <sitemap-url> [<sitemap-url> ...]

Example

Download all 'anyx-guide' and 'ufaq' post types from the Agilicus website:

dl https://www.agilicus.com/anyx-guide-sitemap.xml https://www.agilicus.com/ufaq-sitemap.xml

This command will:

  1. Fetch both sitemap files from www.agilicus.com
  2. Extract all URLs from both sitemaps
  3. Create a directory structure like:
    agilicus.com/
    ├── anyx-guide/
    │   ├── getting-started.md
    │   ├── installation.md
    │   └── configuration.md
    └── ufaq/
        ├── troubleshooting.md
        ├── common-issues.md
        └── support.md
    
  4. Download each URL and convert it to clean markdown format

Command Options

  • -v, --verbose: Enable detailed output showing progress and debugging information
  • -o, --output-dir TEXT: Specify output directory (default: current directory)
  • --dry-run: Show what would be downloaded without actually fetching content
  • --help: Show help message and exit

Examples

Verbose output with custom directory:

dl --verbose --output-dir ./downloads https://example.com/sitemap.xml

Dry run to preview structure:

dl --dry-run https://example.com/sitemap.xml

Multiple sitemaps:

dl https://site1.com/sitemap.xml https://site2.com/sitemap.xml

Directory Structure

The tool creates directories based on URL structure:

URL Directory Filename
https://www.example.com/blog/post1 example.com/blog/ post1.md
https://example.com/docs/guide example.com/docs/ guide.md
https://example.com/ example.com/ index.md

How It Works

  1. Sitemap Parsing: Uses trafilatura's sitemap_search() to extract URLs from XML sitemaps
  2. URL Processing: Parses each URL to determine directory structure and filename
  3. Content Fetching: Downloads each page using trafilatura's fetch_url()
  4. Markdown Conversion: Converts HTML content to clean markdown using trafilatura's extract()
  5. File Organization: Saves markdown files in organized directory structure

Development

Running Tests

poetry run pytest

Running Tests with Coverage

poetry run pytest --cov=dl_md --cov-report=term-missing

Project Structure

dl-md/
├── dl_md/
│   ├── __init__.py
│   └── cli.py          # Main CLI implementation
├── tests/
│   ├── __init__.py
│   └── test_cli.py     # Comprehensive test suite
├── pyproject.toml      # Project configuration
├── poetry.lock         # Dependency lock file
└── README.md           # This file

Dependencies

  • click: Command-line interface framework
  • trafilatura: Web scraping and content extraction
  • requests: HTTP library for web requests
  • pytest: Testing framework (development)
  • pytest-cov: Coverage reporting (development)

Error Handling

The tool gracefully handles various error conditions:

  • Network errors: Continues processing other URLs if one fails
  • Invalid sitemaps: Reports errors and continues with other sitemaps
  • Content extraction failures: Logs failures and continues processing
  • File system errors: Reports permission or disk space issues

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite: poetry run pytest
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions:

  • Check the verbose output with -v flag for debugging information
  • Review the test suite for usage examples
  • Open an issue on the project repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dl_md-0.1.0.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dl_md-0.1.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file dl_md-0.1.0.tar.gz.

File metadata

  • Download URL: dl_md-0.1.0.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dl_md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9e7ad052411aba12816b96c0956fd67d856397b049617c214b04415d85a09315
MD5 4e231b6d11dd0b4fd64588e28071b97b
BLAKE2b-256 aa8c8201c48ecf75f41579e29df4ece2ff9eb94f32d579224a30ea115365f8e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for dl_md-0.1.0.tar.gz:

Publisher: publish.yml on donbowman/dl-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dl_md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dl_md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dl_md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2c258536e69362f539c258d44f273f36f6f2ad94c2e249e32e0bb2df1154f3b
MD5 42f38d3bcc9cbbe520401c69ddb9e1b1
BLAKE2b-256 70f468b4a67d26ad3cf0094387a50161cdfa050f0c8a5612e53ca6f4542f2544

See more details on using hashes here.

Provenance

The following attestation bundles were made for dl_md-0.1.0-py3-none-any.whl:

Publisher: publish.yml on donbowman/dl-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page