Skip to main content

The official Python SDK for the Website Crawler API

Project description

website_crawler_sdk

A Python SDK for interacting with WebsiteCrawler.org, designed to simplify crawling tasks via API. Submit URLs, monitor crawling status, and retrieve structured data in JSON format for ingesting in LLMs with ease.

To use the API, get your free API key from WebsiteCrawler.org settings page (login > visit settings page).

Features

  • Trigger crawl jobs remotely
  • Monitor crawling status in real time
  • Access current URLs being crawled
  • Fetch crawl output as raw JSON
  • Respect API wait times dynamically

Installation

You can install it locally for development:

pip install website_crawler_sdk

Demo script

The objective of the following script is to submit a URL to webistecrawler via its API, get URLs being processed by the crawler in realtime and retrieve the JSON data of the crawled website via the API

import time
from website_crawler_sdk import WebsiteCrawlerConfig, WebsiteCrawlerClient

"""
Author: Pramod Choudhary (websitecrawler.org)
Version: 1.1
Date: July 10, 2025
"""

# Replace with your actual API key, target URL, and limit
YOUR_API_KEY = "YOUR_API_KEY" #Your API key goes here
URL = "YOUR_URL" #Enter a non redirecting URL/domain with https or http
LIMIT = YOUR_LIMIT #Change YOUR_LIMIT 

def main():
    cfg = WebsiteCrawlerConfig(YOUR_API_KEY)
    client = WebsiteCrawlerClient(cfg)

    # Submit URL to WebsiteCrawler.org for crawling
    client.submit_url_to_website_crawler(URL, LIMIT) #Submit the URL and Limit to websitecrawler via API

    while True:
        task_status = client.get_task_status() #Start retrieving data if the task_status is true
        print(f"{task_status} << task status")
        time.sleep(2)  #Wait for 2 seconds

        if not task_status:
           break

        if task_status:
            status = client.get_crawl_status() #get_crawl_status() method gets the crawl status
            currenturl = client.get_current_url() #get_current_url() method gets the current URL
            data = client.get_crawl_data() # get_crawl_data() method gets the structured data once crawling has completed

            if status:
                print(f"Current URL:: {status}")


            if status == "Crawling": #Crawling is one of the status
                print(f"Current URL:: {currenturl}")

            if status == "Completed!":  #Completed! (with exclamation) is one of the status
                print("Task has been completed... closing the loop and gettint the data...")
                if data:
                    print(f"JSON Data:: {data}")
                    time.sleep(20)  # Give extra time for large JSON response
                    break
            
           

    print("Job over")

if __name__ == "__main__":
    main()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

website_crawler_sdk-0.1.1.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

website_crawler_sdk-0.1.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file website_crawler_sdk-0.1.1.tar.gz.

File metadata

  • Download URL: website_crawler_sdk-0.1.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for website_crawler_sdk-0.1.1.tar.gz
Algorithm Hash digest
SHA256 05def3dfc94e36b06f5dd6460bb5cdfa239cf077aab9c39a9eda0e7a30054f8d
MD5 a8b3601199d48c655a98ae8fdcb40eb5
BLAKE2b-256 7c3f2c4d5c5ee46af96f1e6c949e5967e4c9a428a3d6f1dc3fb9b48224016ba6

See more details on using hashes here.

File details

Details for the file website_crawler_sdk-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for website_crawler_sdk-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 173704894260c934aef90c1501fcf0eba91ba433fa0cf715d58aca969833cb1d
MD5 fcf11010c7985acd5bdc7e6bcda3d6b7
BLAKE2b-256 1d76c173fb80beed7c5000bd78b323bf3f5682b4270171946a45f81db68e55ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page