Parse robots.txt files and find indexed urls
Project description
robotsparser
Python library that parses robots.txt files
Functionalities
- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps
Install
pip install robotsparser
Usage
from robotsparser.parser import Robotparser
robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
rb.read() # To initiate the crawl of sitemaps and indexed urls
# Show information
rb.get_urls() # returns a list of all urls
rb.get_sitemaps() # Returns all sitemap locations
rb.get_sitemap_entries() # Returns all sitemap indexes that contain urls
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robotsparser-0.0.2.tar.gz
(3.7 kB
view hashes)
Built Distribution
Close
Hashes for robotsparser-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e24e9125ff8fd3f6d9b111d5e0b53f1de786e441ce24ed7713f6c6103068d29 |
|
MD5 | dde6d9dd666e3152a5850ee26a8e14a6 |
|
BLAKE2b-256 | 77d6f2bbaa2c36652f7aa9ee31db36363f941d6744d75ed05c9e22888c64b805 |