Parse robots.txt files and find indexed urls
Project description
robotsparser
Python library that parses robots.txt files
Functionalities
- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps
Install
pip install robotsparser
Usage
from robotsparser.parser import Robotparser
robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(fetch_sitemap_urls=True, sitemap_url_crawl_limit=5)
# Show information
rb.get_sitemap_indexes() # returns sitemap indexes
rb.get_sitemaps() # returns sitemaps
rb.get_urls() # returns a list of all urls
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robotsparser-0.0.7.tar.gz
(5.2 kB
view hashes)
Built Distribution
Close
Hashes for robotsparser-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 118b3559a30d6231e7623974e7f5f18988bce2d37dab458a6ef2ca14eb7c20fe |
|
MD5 | 16ae780c43064ca96946d59eba605305 |
|
BLAKE2b-256 | 8f5e4fb69402a4df5ba0c9f534318653af6efcca7fd7b8371e74919e126aef5b |