No project description provided
Project description
dl-md
A command-line tool for downloading and converting website content from sitemaps to markdown format with organized directory structure.
Overview
dl-md extracts URLs from XML sitemaps and downloads each page as markdown, automatically organizing the content into a directory structure that mirrors the website's URL hierarchy.
Features
- Sitemap Processing: Extracts URLs from XML sitemaps using trafilatura
- Automatic Directory Structure: Creates directories based on URL paths
- Markdown Conversion: Downloads and converts web pages to clean markdown format
- Progress Reporting: Shows real-time progress as URLs are processed
- Dry Run Mode: Preview what would be downloaded without actually fetching content
- Verbose Output: Detailed logging for troubleshooting
- Comprehensive Testing: Full test suite with 85% code coverage
Installation
Using Poetry (Recommended)
git clone https://github.com/donbowman/dl-md
cd dl-md
poetry install
Using pip
pip install dl-md
Usage
Basic Usage
dl <sitemap-url> [<sitemap-url> ...]
Example
Download all 'anyx-guide' and 'ufaq' post types from the Agilicus website:
dl https://www.agilicus.com/anyx-guide-sitemap.xml https://www.agilicus.com/ufaq-sitemap.xml
This command will:
- Fetch both sitemap files from www.agilicus.com
- Extract all URLs from both sitemaps
- Create a directory structure like:
agilicus.com/ ├── anyx-guide/ │ ├── getting-started.md │ ├── installation.md │ └── configuration.md └── ufaq/ ├── troubleshooting.md ├── common-issues.md └── support.md - Download each URL and convert it to clean markdown format
Command Options
-v, --verbose: Enable detailed output showing progress and debugging information-o, --output-dir TEXT: Specify output directory (default: current directory)--dry-run: Show what would be downloaded without actually fetching content--help: Show help message and exit
Examples
Verbose output with custom directory:
dl --verbose --output-dir ./downloads https://example.com/sitemap.xml
Dry run to preview structure:
dl --dry-run https://example.com/sitemap.xml
Multiple sitemaps:
dl https://site1.com/sitemap.xml https://site2.com/sitemap.xml
Directory Structure
The tool creates directories based on URL structure:
| URL | Directory | Filename |
|---|---|---|
https://www.example.com/blog/post1 |
example.com/blog/ |
post1.md |
https://example.com/docs/guide |
example.com/docs/ |
guide.md |
https://example.com/ |
example.com/ |
index.md |
How It Works
- Sitemap Parsing: Uses trafilatura's
sitemap_search()to extract URLs from XML sitemaps - URL Processing: Parses each URL to determine directory structure and filename
- Content Fetching: Downloads each page using trafilatura's
fetch_url() - Markdown Conversion: Converts HTML content to clean markdown using trafilatura's
extract() - File Organization: Saves markdown files in organized directory structure
Development
Running Tests
poetry run pytest
Running Tests with Coverage
poetry run pytest --cov=dl_md --cov-report=term-missing
Project Structure
dl-md/
├── dl_md/
│ ├── __init__.py
│ └── cli.py # Main CLI implementation
├── tests/
│ ├── __init__.py
│ └── test_cli.py # Comprehensive test suite
├── pyproject.toml # Project configuration
├── poetry.lock # Dependency lock file
└── README.md # This file
Dependencies
- click: Command-line interface framework
- trafilatura: Web scraping and content extraction
- requests: HTTP library for web requests
- pytest: Testing framework (development)
- pytest-cov: Coverage reporting (development)
Error Handling
The tool gracefully handles various error conditions:
- Network errors: Continues processing other URLs if one fails
- Invalid sitemaps: Reports errors and continues with other sitemaps
- Content extraction failures: Logs failures and continues processing
- File system errors: Reports permission or disk space issues
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite:
poetry run pytest - Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues and questions:
- Check the verbose output with
-vflag for debugging information - Review the test suite for usage examples
- Open an issue on the project repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dl_md-0.1.0.tar.gz.
File metadata
- Download URL: dl_md-0.1.0.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e7ad052411aba12816b96c0956fd67d856397b049617c214b04415d85a09315
|
|
| MD5 |
4e231b6d11dd0b4fd64588e28071b97b
|
|
| BLAKE2b-256 |
aa8c8201c48ecf75f41579e29df4ece2ff9eb94f32d579224a30ea115365f8e8
|
Provenance
The following attestation bundles were made for dl_md-0.1.0.tar.gz:
Publisher:
publish.yml on donbowman/dl-md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dl_md-0.1.0.tar.gz -
Subject digest:
9e7ad052411aba12816b96c0956fd67d856397b049617c214b04415d85a09315 - Sigstore transparency entry: 428095345
- Sigstore integration time:
-
Permalink:
donbowman/dl-md@e09bf257518d2f75cfa8cdc55504dfd36ce9af86 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/donbowman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e09bf257518d2f75cfa8cdc55504dfd36ce9af86 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dl_md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dl_md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2c258536e69362f539c258d44f273f36f6f2ad94c2e249e32e0bb2df1154f3b
|
|
| MD5 |
42f38d3bcc9cbbe520401c69ddb9e1b1
|
|
| BLAKE2b-256 |
70f468b4a67d26ad3cf0094387a50161cdfa050f0c8a5612e53ca6f4542f2544
|
Provenance
The following attestation bundles were made for dl_md-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on donbowman/dl-md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dl_md-0.1.0-py3-none-any.whl -
Subject digest:
f2c258536e69362f539c258d44f273f36f6f2ad94c2e249e32e0bb2df1154f3b - Sigstore transparency entry: 428095359
- Sigstore integration time:
-
Permalink:
donbowman/dl-md@e09bf257518d2f75cfa8cdc55504dfd36ce9af86 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/donbowman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e09bf257518d2f75cfa8cdc55504dfd36ce9af86 -
Trigger Event:
push
-
Statement type: