A comprehensive toolkit for parsing sitemaps and converting web content to Markdown

These details have not been verified by PyPI

Project links

Project description

Sitemap Toolkit

A comprehensive toolkit for parsing sitemaps and converting web content to Markdown format. This tool provides capabilities for both sitemap parsing and bulk web content processing.

🚀 Development Setup

Clone the repository
Install development dependencies:

pip install -e ".[dev]"

Run tests:

pytest

🧪 Examples

The examples directory contains sample code demonstrating how to use the package:

Local Example

from sitemap_markitdown.sitemap_parser import SitemapParser

# Parse a local sitemap file
with open('sitemap.xml', 'r') as f:
    content = f.read()
    
parser = SitemapParser(content)
urls = parser.parse()

Remote Example

import requests
from sitemap_markitdown.sitemap_parser import SitemapParser

# Parse a remote sitemap
response = requests.get('https://example.com/sitemap.xml')
parser = SitemapParser(response.text)
urls = parser.parse()

Check out the complete examples in the examples directory:

sitemap.xml: Sample sitemap file
parse_sitemap.py: Script demonstrating both local and remote sitemap parsing

🎓 Tutorials

Quick Start

Install the package:

pip install sitemap-markitdown

Parse a sitemap:

sitemap-markitdown parse https://example.com/sitemap.xml --format json

Basic Sitemap Parsing Tutorial

Parse a sitemap and save to JSON:

sitemap-markitdown parse https://example.com/sitemap.xml -o output.json

Parse a local sitemap file to CSV:

sitemap-markitdown parse ./local-sitemap.xml --format csv -o urls.csv

Converting Web Pages to Markdown

Create a CSV file with URLs (must have a 'loc' column)
Run the conversion:

sitemap-markitdown process-csv --input-csv urls.csv --output-folder markdown_files

📚 How-to Guides

How to Parse a Sitemap from URL

To parse a sitemap and extract all URLs with their metadata:

sitemap-markitdown parse https://example.com/sitemap.xml --format json

Options:

--format: Choose between 'json' or 'csv' output (default: json)
--output: Specify output file path
--llm-model: Optionally specify an LLM model for enhanced processing

How to Process Multiple URLs to Markdown

To convert a list of URLs from a CSV file to Markdown format:

sitemap-markitdown process-csv \
    --input-csv urls.csv \
    --output-folder markdown_output \
    --output-csv processing_report.csv

The CSV file should contain a 'loc' column with the URLs to process.

How to Customize Output Formats

For JSON output with pretty printing:

sitemap-markitdown parse sitemap.xml --format json

For CSV output with all metadata:

sitemap-markitdown parse sitemap.xml --format csv

📖 Reference

CLI Commands

`parse`

Parse sitemap from file or URL.

Arguments:

source: URL or file path to sitemap

Options:

--output, -o: Output file path
--format, -f: Output format (json/csv)
--llm-model: LLM model for processing

`process-csv`

Convert URLs from CSV to Markdown.

Options:

--input-csv: Path to input CSV file (required)
--output-folder: Folder for Markdown files (default: "outputs")
--output-csv: Path for processing report CSV

Output Formats

JSON Format

[
  {
    "loc": "https://example.com/page",
    "lastmod": "2024-01-23",
    "changefreq": "daily",
    "priority": "0.8"
  }
]

CSV Format

Contains columns:

loc
lastmod
changefreq
priority

Dependencies

Core dependencies:

click: CLI interface
lxml: XML processing
markitdown: Web to Markdown conversion
tqdm: Progress bars

Optional dependencies:

openai: Enhanced processing capabilities

Development dependencies:

pytest: Testing framework
pytest-cov: Code coverage reporting

🤔 Explanation

Project Architecture

The toolkit is built with modularity in mind:

SitemapParser: Core component for XML parsing
- Handles both local and remote sitemaps
- Extracts metadata using XPath
- Validates against sitemap schema
CLI Interface: Built with Click
- Provides intuitive command structure
- Handles errors gracefully
- Supports multiple output formats

Why Use Sitemap Parsing?

Sitemaps are essential for:

SEO optimization
Content discovery
Site structure analysis
Bulk content processing

This toolkit simplifies these tasks by providing:

Automated parsing
Flexible output formats
Bulk processing capabilities
Progress tracking

Design Decisions

Command Structure
- Separate commands for parsing and processing
- Consistent option naming
- Progress indicators for long operations
Output Formats
- JSON for programmatic use
- CSV for spreadsheet compatibility
- Markdown for content preservation
Error Handling
- Graceful failure modes
- Detailed error reporting
- Progress preservation in long operations

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap_markitdown-0.1.0.tar.gz (11.5 kB view details)

Uploaded Jan 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sitemap_markitdown-0.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Jan 23, 2025 Python 3

File details

Details for the file sitemap_markitdown-0.1.0.tar.gz.

File metadata

Download URL: sitemap_markitdown-0.1.0.tar.gz
Upload date: Jan 23, 2025
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sitemap_markitdown-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5a0258541f49c6d0933c36dd8aee864673acefcab80cdc68c1fee02e183bc386`
MD5	`420aeaa873bf6ffbfa7925254a427aaf`
BLAKE2b-256	`0c67c173ba0d1ae371b7c1d83e9ea091910931f657daf7191a8914fdf0726a21`

See more details on using hashes here.

File details

Details for the file sitemap_markitdown-0.1.0-py3-none-any.whl.

File metadata

Download URL: sitemap_markitdown-0.1.0-py3-none-any.whl
Upload date: Jan 23, 2025
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sitemap_markitdown-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e16c6b125cf4019a516fc3edf1239b27701dccc42ea1b70e065db14210602f9d`
MD5	`66d39a8dfc448d3e78a6bd8da32241f4`
BLAKE2b-256	`3cb0e6d1254dbe12115a080d55a85f10c4df15a252f828d1c7c7a5927be21ac0`

See more details on using hashes here.

sitemap-markitdown 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sitemap Toolkit

🚀 Development Setup

🧪 Examples

Local Example

Remote Example

🎓 Tutorials

Quick Start

Basic Sitemap Parsing Tutorial

Converting Web Pages to Markdown

📚 How-to Guides

How to Parse a Sitemap from URL

How to Process Multiple URLs to Markdown

How to Customize Output Formats

📖 Reference

CLI Commands

parse

process-csv

Output Formats

JSON Format

CSV Format

Dependencies

🤔 Explanation

Project Architecture

Why Use Sitemap Parsing?

Design Decisions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`parse`

`process-csv`