Parse robots.txt files and find indexed urls
Project description
robotsparser
Python library that parses robots.txt files
Functionalities
- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps
Install
pip install robotsparser
Usage
from robotsparser.parser import Robotparser
robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(fetch_sitemap_urls=True, sitemap_url_crawl_limit=5)
# Show information
rb.get_sitemap_indexes() # returns sitemap indexes
rb.get_sitemaps() # returns sitemaps
rb.get_urls() # returns a list of all urls
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robotsparser-0.0.6.tar.gz
(5.2 kB
view hashes)
Built Distribution
Close
Hashes for robotsparser-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 834c7c7b91a0728ff74f5547da85f0abfecf4cd52acb2952957b6247925e9dd3 |
|
MD5 | 0886fed52b459ad335f906ac5a2dd976 |
|
BLAKE2b-256 | ed3fa5c8da0bb0405a7b80eb2203ef90fa1baa59fdd92e8686ce444247abd241 |