Skip to main content

Parse robots.txt files and find indexed urls

Project description

robotsparser

Python library that parses robots.txt files

Functionalities

  • Automatically discover all sitemap files
  • Unzip gziped files
  • Fetch all URLs from sitemaps

Install

pip install robotsparser

Usage

from robotsparser.parser import Robotparser

robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
rb.read() # To initiate the crawl of sitemaps and indexed urls

# Show information
rb.get_urls() # returns a list of all urls
rb.get_sitemaps() # Returns all sitemap locations
rb.get_sitemap_entries() # Returns all sitemap indexes that contain urls

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotsparser-0.0.2.tar.gz (3.7 kB view hashes)

Uploaded Source

Built Distribution

robotsparser-0.0.2-py3-none-any.whl (3.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page