Skip to main content

Parse robots.txt files and find indexed urls

Project description

robotsparser

Python library that parses robots.txt files

Functionalities

  • Automatically discover all sitemap files
  • Unzip gziped files
  • Fetch all URLs from sitemaps

Install

pip install robotsparser

Usage

from robotsparser.parser import Robotparser

robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(fetch_sitemap_urls=True, sitemap_url_crawl_limit=5)

# Show information
rb.get_sitemap_indexes() # returns sitemap indexes
rb.get_sitemaps() # returns sitemaps
rb.get_urls() # returns a list of all urls

Crawl in the background using thread

Crawl in the background and output new entries to file

This is useful for sites where sitemaps are heavily nested and take a long time to crawl

from robotsparser.parser import Robotparser
import threading

if __name__ == '__main__':
    robots_url = "https://www.example.com/robots.txt"
    rb = Robotparser(url=robots_url, verbose=False)

    sitemap_crawl_proc = threading.Thread(target = rb.read, kwargs = {'fetch_sitemap_urls': False}, daemon=True)
    sitemap_crawl_proc.start()

    while sitemap_crawl_proc.is_alive():
        time.sleep(1)
        print(f"entries_count: {len(rb.get_sitemap_entries())}, indexes: {len(rb.get_sitemap_indexes())}")
        # any logic here to get object data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotsparser-0.0.12.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotsparser-0.0.12-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file robotsparser-0.0.12.tar.gz.

File metadata

  • Download URL: robotsparser-0.0.12.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for robotsparser-0.0.12.tar.gz
Algorithm Hash digest
SHA256 8d6ad16e41f6d13d31d7d29c2fb9f58a7c612299bb3d2e40d697f082af545023
MD5 c421d9eebaf2a1e2fb4f7ef94a826bbb
BLAKE2b-256 c0fea88d6490d381cbd95c49e88551b97d64fd6714fd454044988afb6527113f

See more details on using hashes here.

File details

Details for the file robotsparser-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: robotsparser-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for robotsparser-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 00400bfa27e1d0f7d7e9bd6461ba145b7252405d9e6a595ff1930f9ca754b3c3
MD5 0ad2800a218fc426793ce88ad28369a1
BLAKE2b-256 ce4b82dc9b5f373f2a11a71c23716fec6635d4ab03ab8f8db8953535a1c6b72a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page