A comprehensive toolkit for parsing sitemaps and converting web content to Markdown
Project description
Sitemap Toolkit
A comprehensive toolkit for parsing sitemaps and converting web content to Markdown format. This tool provides capabilities for both sitemap parsing and bulk web content processing.
🚀 Development Setup
- Clone the repository
- Install development dependencies:
pip install -e ".[dev]"
- Run tests:
pytest
🧪 Examples
The examples directory contains sample code demonstrating how to use the package:
Local Example
from sitemap_markitdown.sitemap_parser import SitemapParser
# Parse a local sitemap file
with open('sitemap.xml', 'r') as f:
content = f.read()
parser = SitemapParser(content)
urls = parser.parse()
Remote Example
import requests
from sitemap_markitdown.sitemap_parser import SitemapParser
# Parse a remote sitemap
response = requests.get('https://example.com/sitemap.xml')
parser = SitemapParser(response.text)
urls = parser.parse()
Check out the complete examples in the examples directory:
sitemap.xml: Sample sitemap fileparse_sitemap.py: Script demonstrating both local and remote sitemap parsing
🎓 Tutorials
Quick Start
- Install the package:
pip install sitemap-markitdown
- Parse a sitemap:
sitemap-markitdown parse https://example.com/sitemap.xml --format json
Basic Sitemap Parsing Tutorial
- Parse a sitemap and save to JSON:
sitemap-markitdown parse https://example.com/sitemap.xml -o output.json
- Parse a local sitemap file to CSV:
sitemap-markitdown parse ./local-sitemap.xml --format csv -o urls.csv
Converting Web Pages to Markdown
- Create a CSV file with URLs (must have a 'loc' column)
- Run the conversion:
sitemap-markitdown process-csv --input-csv urls.csv --output-folder markdown_files
📚 How-to Guides
How to Parse a Sitemap from URL
To parse a sitemap and extract all URLs with their metadata:
sitemap-markitdown parse https://example.com/sitemap.xml --format json
Options:
--format: Choose between 'json' or 'csv' output (default: json)--output: Specify output file path--llm-model: Optionally specify an LLM model for enhanced processing
How to Process Multiple URLs to Markdown
To convert a list of URLs from a CSV file to Markdown format:
sitemap-markitdown process-csv \
--input-csv urls.csv \
--output-folder markdown_output \
--output-csv processing_report.csv
The CSV file should contain a 'loc' column with the URLs to process.
How to Customize Output Formats
- For JSON output with pretty printing:
sitemap-markitdown parse sitemap.xml --format json
- For CSV output with all metadata:
sitemap-markitdown parse sitemap.xml --format csv
📖 Reference
CLI Commands
parse
Parse sitemap from file or URL.
Arguments:
source: URL or file path to sitemap
Options:
--output, -o: Output file path--format, -f: Output format (json/csv)--llm-model: LLM model for processing
process-csv
Convert URLs from CSV to Markdown.
Options:
--input-csv: Path to input CSV file (required)--output-folder: Folder for Markdown files (default: "outputs")--output-csv: Path for processing report CSV
Output Formats
JSON Format
[
{
"loc": "https://example.com/page",
"lastmod": "2024-01-23",
"changefreq": "daily",
"priority": "0.8"
}
]
CSV Format
Contains columns:
- loc
- lastmod
- changefreq
- priority
Dependencies
Core dependencies:
- click: CLI interface
- lxml: XML processing
- markitdown: Web to Markdown conversion
- tqdm: Progress bars
Optional dependencies:
- openai: Enhanced processing capabilities
Development dependencies:
- pytest: Testing framework
- pytest-cov: Code coverage reporting
🤔 Explanation
Project Architecture
The toolkit is built with modularity in mind:
-
SitemapParser: Core component for XML parsing
- Handles both local and remote sitemaps
- Extracts metadata using XPath
- Validates against sitemap schema
-
CLI Interface: Built with Click
- Provides intuitive command structure
- Handles errors gracefully
- Supports multiple output formats
Why Use Sitemap Parsing?
Sitemaps are essential for:
- SEO optimization
- Content discovery
- Site structure analysis
- Bulk content processing
This toolkit simplifies these tasks by providing:
- Automated parsing
- Flexible output formats
- Bulk processing capabilities
- Progress tracking
Design Decisions
-
Command Structure
- Separate commands for parsing and processing
- Consistent option naming
- Progress indicators for long operations
-
Output Formats
- JSON for programmatic use
- CSV for spreadsheet compatibility
- Markdown for content preservation
-
Error Handling
- Graceful failure modes
- Detailed error reporting
- Progress preservation in long operations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitemap_markitdown-0.1.0.tar.gz.
File metadata
- Download URL: sitemap_markitdown-0.1.0.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a0258541f49c6d0933c36dd8aee864673acefcab80cdc68c1fee02e183bc386
|
|
| MD5 |
420aeaa873bf6ffbfa7925254a427aaf
|
|
| BLAKE2b-256 |
0c67c173ba0d1ae371b7c1d83e9ea091910931f657daf7191a8914fdf0726a21
|
File details
Details for the file sitemap_markitdown-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sitemap_markitdown-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e16c6b125cf4019a516fc3edf1239b27701dccc42ea1b70e065db14210602f9d
|
|
| MD5 |
66d39a8dfc448d3e78a6bd8da32241f4
|
|
| BLAKE2b-256 |
3cb0e6d1254dbe12115a080d55a85f10c4df15a252f828d1c7c7a5927be21ac0
|