A tool to crawl website sitemaps and create a CSV report of URLs and their metadata
Project description
🗺️ Sitemap Harvester
🚀 A blazingly fast Python tool to harvest URLs and metadata from website sitemaps like a digital archaeologist!
🚀 Quick Start
Installation
pip install sitemap-harvester
Basic Usage
# Harvest a website's sitemap
sitemap-harvester --url https://example.com
# Custom output file and timeout
sitemap-harvester --url https://example.com --output my_data.csv --timeout 15
🎯 What Gets Extracted?
- 📝 Page Title - The main title of each page
- 📄 Meta Description - SEO descriptions
- 🏷️ Keywords - Meta keywords (if present)
- 👤 Author - Page author information
- 🔗 Canonical URL - Canonical link references
- 🖼️ Open Graph Data - Social media metadata
- 🌐 Custom Meta Tags - Any additional meta information
💡 Pro Tips
- Use
--timeoutfor slower websites or large sitemaps - The tool automatically deduplicates URLs for you
- Check the console output for real-time progress updates
- Large sitemaps? Grab a coffee ☕ and let it work its magic!
🤝 Contributing
Found a bug? Have a feature request? Contributions are welcome! Feel free to open an issue or submit a pull request.
📜 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Happy harvesting! 🌾
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitemap_harvester-1.0.1-py3-none-any.whl.
File metadata
- Download URL: sitemap_harvester-1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8d2c3f5da1019c89746dafaa5725b8411a4cb8cb2c825cc7fe238f3bcc2b885
|
|
| MD5 |
deb20ec2981355a488a6bd2224da2f0e
|
|
| BLAKE2b-256 |
00f50e47213ec599e292ef350ac9d82ddf7cbcebc9572908ccdd4b7bbc699f4d
|
Provenance
The following attestation bundles were made for sitemap_harvester-1.0.1-py3-none-any.whl:
Publisher:
ci.yml on meysam81/sitemap-harvester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sitemap_harvester-1.0.1-py3-none-any.whl -
Subject digest:
e8d2c3f5da1019c89746dafaa5725b8411a4cb8cb2c825cc7fe238f3bcc2b885 - Sigstore transparency entry: 582742220
- Sigstore integration time:
-
Permalink:
meysam81/sitemap-harvester@92697d9b0eae1edc0f3b5460e8456a70fe597ffa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/meysam81
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@92697d9b0eae1edc0f3b5460e8456a70fe597ffa -
Trigger Event:
push
-
Statement type: