Parse robots.txt files and find indexed urls
Project description
robotsparser
Python library that parses robots.txt files
Functionalities
- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps
Install
pip install robotsparser
Usage
from robotsparser.parser import Robotparser
robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
rb.read() # To initiate the crawl of sitemaps and indexed urls
# Show information
rb.get_urls() # returns a list of all urls
rb.get_sitemaps() # Returns all sitemap locations
rb.get_sitemap_entries() # Returns all sitemap indexes that contain urls
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robotsparser-0.0.1.tar.gz
(3.5 kB
view hashes)
Built Distribution
Close
Hashes for robotsparser-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf75c9b88a29a5c2a732ce811031151cc16f238e69ead9e7e0fe89c72c2388ad |
|
MD5 | 08236bfca34ef8311d3c5f9014de76f0 |
|
BLAKE2b-256 | d70dd699a9a21b3fe1685a603b60e272381f7b02b2253026007c5b2c9ac73ae7 |