Skip to main content

Parse robots.txt files and find indexed urls

Project description

robotsparser

Python library that parses robots.txt files

Functionalities

  • Automatically discover all sitemap files
  • Unzip gziped files
  • Fetch all URLs from sitemaps

Install

pip install robotsparser

Usage

from robotsparser.parser import Robotparser

robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(fetch_sitemap_urls=True, sitemap_url_crawl_limit=5)

# Show information
rb.get_sitemap_indexes() # returns sitemap indexes
rb.get_sitemaps() # returns sitemaps
rb.get_urls() # returns a list of all urls

Crawl in the background using thread

Crawl in the background and output new entries to file

This is useful for sites where sitemaps are heavily nested and take a long time to crawl

from robotsparser.parser import Robotparser
import threading

if __name__ == '__main__':
    robots_url = "https://www.example.com/robots.txt"
    rb = Robotparser(url=robots_url, verbose=False)

    sitemap_crawl_proc = threading.Thread(target = rb.read, kwargs = {'fetch_sitemap_urls': False}, daemon=True)
    sitemap_crawl_proc.start()

    while sitemap_crawl_proc.is_alive():
        time.sleep(1)
        print(f"entries_count: {len(rb.get_sitemap_entries())}, indexes: {len(rb.get_sitemap_indexes())}")
        # any logic here to get object data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotsparser-0.0.12.tar.gz (5.6 kB view hashes)

Uploaded Source

Built Distribution

robotsparser-0.0.12-py3-none-any.whl (5.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page