Skip to main content

SDK for interacting with WebsiteCrawler.org

Project description

website_crawler_sdk

A Python SDK for interacting with WebsiteCrawler.org, designed to simplify crawling tasks via API. Submit URLs, monitor crawling status, and retrieve structured data with ease.

To use the API, get your API key from WebsiteCrawler.org

🔧 Features

  • Trigger crawl jobs remotely
  • Monitor crawling status in real time
  • Access current URLs being crawled
  • Fetch crawl output as raw JSON
  • Respect API wait times dynamically

📦 Installation

You can install it locally for development:

pip install website_crawler_sdk

##Demo

import time
from website_crawler_sdk import WebsiteCrawlerConfig, WebsiteCrawlerClient

# Replace with your actual API key, target URL, and limit
YOUR_API_KEY = "YOUR_API_KEY" #Your API key goes here
URL = "URL" #Enter a non redirecting URL/domain with https or http
LIMIT = LIMIT #Change limit 

def main():
    cfg = WebsiteCrawlerConfig(YOUR_API_KEY)
    client = WebsiteCrawlerClient(cfg)

    # Submit URL to WebsiteCrawler.org for crawling
    client.submit_url_to_website_crawler(URL, LIMIT) #Submit the URL and Limit to websitecrawler via API

    while True:
        task_status = client.get_task_status() #Start retrieving data if the task_status is true
        print(f"{task_status} << task status")
        time.sleep(2)  #Wait for 2 seconds

        if task_status:
            status = client.get_crawl_status() #get_crawl_status() method gets the crawl status
            currenturl = client.get_current_url() #get_current_url() method gets the current URL
            data = client.get_crawl_data() # get_crawl_data() method gets the structured data once crawling has completed

            print("Crawl status::")
            if status:
                print(status)

            if status == "Crawling": #Crawling is one of the status
                print(f"Current URL:: {currenturl}")

            if status == "Completed!":  #Completed! (with exclamation) is one of the status
                print("Task has been completed... closing the loop")
                if data:
                    print(f"JSON Data:: {data}")
                    time.sleep(20)  # Give extra time for large JSON response
                    break

    print("Job over")

if __name__ == "__main__":
    main()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

website_crawler_sdk-0.1.0.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

website_crawler_sdk-0.1.0-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file website_crawler_sdk-0.1.0.tar.gz.

File metadata

  • Download URL: website_crawler_sdk-0.1.0.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for website_crawler_sdk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a79085bed988b0f692c9dfd3853ab643fcb0472a2604fabb167e6e3193038b0a
MD5 2ccd8349b4c38f263969a3d8fe8eecac
BLAKE2b-256 34765cbd27e69b76fdeb50ee0bf81ae5d4667db4dcefc1515d18f898cb36dee7

See more details on using hashes here.

File details

Details for the file website_crawler_sdk-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for website_crawler_sdk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2ecf3f24a14516fc7d62556439a3d4170330eddf7b6b76e8661802cdf417e97f
MD5 3c603fa77dec8d20b2a3f02fd2f3ef4c
BLAKE2b-256 ace939701ce3dd6092e0c166909a98835b9a4798ec4fa81123a29039d0decf10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page