Skip to main content

A comprehensive toolkit for parsing sitemaps and converting web content to Markdown

Project description

Sitemap Toolkit

A comprehensive toolkit for parsing sitemaps and converting web content to Markdown format. This tool provides capabilities for both sitemap parsing and bulk web content processing.

🚀 Development Setup

  1. Clone the repository
  2. Install development dependencies:
pip install -e ".[dev]"
  1. Run tests:
pytest

🧪 Examples

The examples directory contains sample code demonstrating how to use the package:

Local Example

from sitemap_markitdown.sitemap_parser import SitemapParser

# Parse a local sitemap file
with open('sitemap.xml', 'r') as f:
    content = f.read()
    
parser = SitemapParser(content)
urls = parser.parse()

Remote Example

import requests
from sitemap_markitdown.sitemap_parser import SitemapParser

# Parse a remote sitemap
response = requests.get('https://example.com/sitemap.xml')
parser = SitemapParser(response.text)
urls = parser.parse()

Check out the complete examples in the examples directory:

  • sitemap.xml: Sample sitemap file
  • parse_sitemap.py: Script demonstrating both local and remote sitemap parsing

🎓 Tutorials

Quick Start

  1. Install the package:
pip install sitemap-markitdown
  1. Parse a sitemap:
sitemap-markitdown parse https://example.com/sitemap.xml --format json

Basic Sitemap Parsing Tutorial

  1. Parse a sitemap and save to JSON:
sitemap-markitdown parse https://example.com/sitemap.xml -o output.json
  1. Parse a local sitemap file to CSV:
sitemap-markitdown parse ./local-sitemap.xml --format csv -o urls.csv

Converting Web Pages to Markdown

  1. Create a CSV file with URLs (must have a 'loc' column)
  2. Run the conversion:
sitemap-markitdown process-csv --input-csv urls.csv --output-folder markdown_files

📚 How-to Guides

How to Parse a Sitemap from URL

To parse a sitemap and extract all URLs with their metadata:

sitemap-markitdown parse https://example.com/sitemap.xml --format json

Options:

  • --format: Choose between 'json' or 'csv' output (default: json)
  • --output: Specify output file path
  • --llm-model: Optionally specify an LLM model for enhanced processing

How to Process Multiple URLs to Markdown

To convert a list of URLs from a CSV file to Markdown format:

sitemap-markitdown process-csv \
    --input-csv urls.csv \
    --output-folder markdown_output \
    --output-csv processing_report.csv

The CSV file should contain a 'loc' column with the URLs to process.

How to Customize Output Formats

  1. For JSON output with pretty printing:
sitemap-markitdown parse sitemap.xml --format json
  1. For CSV output with all metadata:
sitemap-markitdown parse sitemap.xml --format csv

📖 Reference

CLI Commands

parse

Parse sitemap from file or URL.

Arguments:

  • source: URL or file path to sitemap

Options:

  • --output, -o: Output file path
  • --format, -f: Output format (json/csv)
  • --llm-model: LLM model for processing

process-csv

Convert URLs from CSV to Markdown.

Options:

  • --input-csv: Path to input CSV file (required)
  • --output-folder: Folder for Markdown files (default: "outputs")
  • --output-csv: Path for processing report CSV

Output Formats

JSON Format

[
  {
    "loc": "https://example.com/page",
    "lastmod": "2024-01-23",
    "changefreq": "daily",
    "priority": "0.8"
  }
]

CSV Format

Contains columns:

  • loc
  • lastmod
  • changefreq
  • priority

Dependencies

Core dependencies:

  • click: CLI interface
  • lxml: XML processing
  • markitdown: Web to Markdown conversion
  • tqdm: Progress bars

Optional dependencies:

  • openai: Enhanced processing capabilities

Development dependencies:

  • pytest: Testing framework
  • pytest-cov: Code coverage reporting

🤔 Explanation

Project Architecture

The toolkit is built with modularity in mind:

  1. SitemapParser: Core component for XML parsing

    • Handles both local and remote sitemaps
    • Extracts metadata using XPath
    • Validates against sitemap schema
  2. CLI Interface: Built with Click

    • Provides intuitive command structure
    • Handles errors gracefully
    • Supports multiple output formats

Why Use Sitemap Parsing?

Sitemaps are essential for:

  • SEO optimization
  • Content discovery
  • Site structure analysis
  • Bulk content processing

This toolkit simplifies these tasks by providing:

  • Automated parsing
  • Flexible output formats
  • Bulk processing capabilities
  • Progress tracking

Design Decisions

  1. Command Structure

    • Separate commands for parsing and processing
    • Consistent option naming
    • Progress indicators for long operations
  2. Output Formats

    • JSON for programmatic use
    • CSV for spreadsheet compatibility
    • Markdown for content preservation
  3. Error Handling

    • Graceful failure modes
    • Detailed error reporting
    • Progress preservation in long operations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap_markitdown-0.1.0.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitemap_markitdown-0.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file sitemap_markitdown-0.1.0.tar.gz.

File metadata

  • Download URL: sitemap_markitdown-0.1.0.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sitemap_markitdown-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5a0258541f49c6d0933c36dd8aee864673acefcab80cdc68c1fee02e183bc386
MD5 420aeaa873bf6ffbfa7925254a427aaf
BLAKE2b-256 0c67c173ba0d1ae371b7c1d83e9ea091910931f657daf7191a8914fdf0726a21

See more details on using hashes here.

File details

Details for the file sitemap_markitdown-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sitemap_markitdown-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e16c6b125cf4019a516fc3edf1239b27701dccc42ea1b70e065db14210602f9d
MD5 66d39a8dfc448d3e78a6bd8da32241f4
BLAKE2b-256 3cb0e6d1254dbe12115a080d55a85f10c4df15a252f828d1c7c7a5927be21ac0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page