Parse robots.txt files and find indexed urls
Project description
robotsparser
Python library that parses robots.txt files
Functionalities
- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps
Install
pip install robotsparser
Usage
from robotsparser.parser import Robotparser
robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(sitemap_crawl_limit=5)
# Show information
rb.get_urls() # returns a list of all urls
rb.get_sitemaps() # Returns all sitemap locations
rb.get_sitemap_entries() # Returns all sitemap indexes that contain urls
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robotsparser-0.0.3.tar.gz
(3.7 kB
view hashes)
Built Distribution
Close
Hashes for robotsparser-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 082067aaf627e2a71cb536b559a5b117aa005bb4948ef75fe6a726037dd14168 |
|
MD5 | 048761892662f45665f5ac3cfe2ed794 |
|
BLAKE2b-256 | b0d2544503e2761f543e5a4ed5a11a33d0b457225d5bbfa5bc84c96595022636 |