Parse robots.txt files and find indexed urls
Project description
robotsparser
Python library that parses robots.txt files
Functionalities
- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps
Install
pip install robotsparser
Usage
from robotsparser.parser import Robotparser
robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(sitemap_crawl_limit=5)
# Show information
rb.get_urls() # returns a list of all urls
rb.get_sitemaps() # Returns all sitemap locations
rb.get_sitemap_entries() # Returns all sitemap indexes that contain urls
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robotsparser-0.0.5.tar.gz
(3.8 kB
view hashes)
Built Distribution
Close
Hashes for robotsparser-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10a76e494b7c41a0c8a14f7d91e27942e0533365bdf3d3d9174937e8ba667f23 |
|
MD5 | 267f404aed3b99ffefbc7f04b1469794 |
|
BLAKE2b-256 | fa4efa038c2f1519644e73fca94f7ea2b5f8bbc7d47b0d8b026947bd43ca6625 |