Ultimate Sitemap Parser
Project description
Website sitemap parser for Python 3.5+.
Features
- Supports all sitemap formats:
- Field-tested with ~1 million URLs as part of the Media Cloud project
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in robots.txt
- Uses fast and memory efficient Expat XML parsing
- Doesn’t consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested
Installation
pip install ultimate_sitemap_parser
Usage
from usp.tree import sitemap_tree_for_homepage tree = sitemap_tree_for_homepage('https://www.nytimes.com/') print(tree)
sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses.
If you’d like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:
# all_pages() returns an Iterator for page in tree.all_pages(): print(page)
all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for ultimate_sitemap_parser-0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9825fefcdf515e2748addc7ec5dcdb6430dfdd4ef5de4a54e39de1e7613d0ece |
|
MD5 | 362e6e5d4b993d6e89eb4a259ccd029e |
|
BLAKE2-256 | 214404eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff |
Close
Hashes for ultimate_sitemap_parser-0.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 806e723eeb0293c38e111822d651e987b1494ae9c08be82e73172ade667418a6 |
|
MD5 | 5479eb21fc1626a54642dc06ae9613de |
|
BLAKE2-256 | ee58a6394d980bda84c44b442a3bab5ceb49626d01d4b17fbc7fe6d41b90c496 |