Ultimate Sitemap Parser
Project description
Website sitemap parser for Python 3.5+.
Features
- Supports all sitemap formats:
- Field-tested with ~1 million URLs as part of the Media Cloud project
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in robots.txt
- Uses fast and memory efficient Expat XML parsing
- Doesn’t consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested
Installation
pip install ultimate_sitemap_parser
Usage
from usp.tree import sitemap_tree_for_homepage tree = sitemap_tree_for_homepage('https://www.nytimes.com/') print(tree)
sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses.
If you’d like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:
# all_pages() returns an Iterator for page in tree.all_pages(): print(page)
all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size ultimate_sitemap_parser-0.5-py2.py3-none-any.whl (23.2 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes View |
Filename, size ultimate_sitemap_parser-0.5.tar.gz (20.2 kB) | File type Source | Python version None | Upload date | Hashes View |
Close
Hashes for ultimate_sitemap_parser-0.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 806e723eeb0293c38e111822d651e987b1494ae9c08be82e73172ade667418a6 |
|
MD5 | 5479eb21fc1626a54642dc06ae9613de |
|
BLAKE2-256 | ee58a6394d980bda84c44b442a3bab5ceb49626d01d4b17fbc7fe6d41b90c496 |
Close
Hashes for ultimate_sitemap_parser-0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9825fefcdf515e2748addc7ec5dcdb6430dfdd4ef5de4a54e39de1e7613d0ece |
|
MD5 | 362e6e5d4b993d6e89eb4a259ccd029e |
|
BLAKE2-256 | 214404eada3b1b1f825eb18b93e385ff652778c96902788b87a9b1e0a141ccff |