Skip to main content

Python SDK for WebCrawler API

Project description

WebCrawler API Python SDK

A Python SDK for interacting with the WebCrawlerAPI.

In order to us API you have to get an API key from WebCrawlerAPI

Installation

pip install webcrawlerapi

Usage

Crawling

from webcrawlerapi import WebCrawlerAPI

# Initialize the client
crawler = WebCrawlerAPI(api_key="your_api_key")

# Synchronous crawling (blocks until completion)
job = crawler.crawl(
    url="https://example.com",
    scrape_type="markdown",
    items_limit=10,
    webhook_url="https://yourserver.com/webhook",
    allow_subdomains=False,
    max_polls=100  # Optional: maximum number of status checks. Use higher for bigger websites
)
print(f"Job completed with status: {job.status}")

# Access job items and their content
for item in job.job_items:
    print(f"Page title: {item.title}")
    print(f"Original URL: {item.original_url}")
    print(f"Item status: {item.status}")
    
    # Get the content based on job's scrape_type
    # Returns None if item is not in "done" status
    content = item.content
    if content:
        print(f"Content length: {len(content)}")
        print(f"Content preview: {content[:200]}...")
    else:
        print("Content not available or item not done")

# Access job items and their parent job
for item in job.job_items:
    print(f"Item URL: {item.original_url}")
    print(f"Parent job status: {item.job.status}")
    print(f"Parent job URL: {item.job.url}")

# Or use asynchronous crawling
response = crawler.crawl_async(
    url="https://example.com",
    scrape_type="markdown",
    items_limit=10,
    webhook_url="https://yourserver.com/webhook",
    allow_subdomains=False
)

# Get the job ID from the response
job_id = response.id
print(f"Crawling job started with ID: {job_id}")

# Check job status and get results
job = crawler.get_job(job_id)
print(f"Job status: {job.status}")

# Access job details
print(f"Crawled URL: {job.url}")
print(f"Created at: {job.created_at}")
print(f"Number of items: {len(job.job_items)}")

# Cancel a running job if needed
cancel_response = crawler.cancel_job(job_id)
print(f"Cancellation response: {cancel_response['message']}")

Scraping

# Returns structured data directly
response = crawler.scrape(
    "url": "https://webcrawlerapi.com"
)
if response.success:
    print(response.markdown)
else:
    print(f"Code: {response.error_code} Error: {response.error_message}")

API Methods

crawl()

Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:

  • The job reaches a terminal state (done, error, or cancelled)
  • The maximum number of polls is reached (default: 100)
  • The polling interval is determined by the server's recommended_pull_delay_ms or defaults to 5 seconds

crawl_async()

Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.

get_job()

Retrieves the current status and details of a specific job.

cancel_job()

Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.

scrape()

Scrapes a single URL and returns the markdown, cleaned or raw content, page status code and page title.

Scrape Params

Read more in API Docs

  • url (required): The URL to scrape.
  • output_format (required): The format of the output. Can be "markdown", "cleaned" or "raw"

Parameters

Crawl Methods (crawl and crawl_async)

  • url (required): The seed URL where the crawler starts. Can be any valid URL.
  • scrape_type (default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".
  • items_limit (default: 10): Crawler will stop when it reaches this limit of pages for this job.
  • webhook_url (optional): The URL where the server will send a POST request once the task is completed.
  • allow_subdomains (default: False): If True, the crawler will also crawl subdomains.
  • whitelist_regexp (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
  • blacklist_regexp (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.
  • max_polls (optional, crawl only): Maximum number of status checks before returning (default: 100)

Responses

CrawlAsync Response

The crawl_async() method returns a CrawlResponse object with:

  • id: The unique identifier of the created job

Job Response

The Job object contains detailed information about the crawling job:

  • id: The unique identifier of the job
  • org_id: Your organization identifier
  • url: The seed URL where the crawler started
  • status: The status of the job (new, in_progress, done, error)
  • scrape_type: The type of scraping performed
  • created_at: The date when the job was created
  • finished_at: The date when the job was finished (if completed)
  • webhook_url: The webhook URL for notifications
  • webhook_status: The status of the webhook request
  • webhook_error: Any error message if the webhook request failed
  • job_items: List of JobItem objects representing crawled pages
  • recommended_pull_delay_ms: Server-recommended delay between status checks

JobItem Properties

Each JobItem object represents a crawled page and contains:

  • id: The unique identifier of the item
  • job_id: The parent job identifier
  • job: Reference to the parent Job object
  • original_url: The URL of the page
  • page_status_code: The HTTP status code of the page request
  • status: The status of the item (new, in_progress, done, error)
  • title: The page title
  • created_at: The date when the item was created
  • cost: The cost of the item in $
  • referred_url: The URL where the page was referred from
  • last_error: Any error message if the item failed
  • error_code: The error code if the item failed (if available)
  • content: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.
  • raw_content_url: URL to the raw content (if available)
  • cleaned_content_url: URL to the cleaned content (if scrape_type is "cleaned")
  • markdown_content_url: URL to the markdown content (if scrape_type is "markdown")

Requirements

  • Python 3.6+
  • requests>=2.25.0

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcrawlerapi-2.0.3.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webcrawlerapi-2.0.3-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file webcrawlerapi-2.0.3.tar.gz.

File metadata

  • Download URL: webcrawlerapi-2.0.3.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for webcrawlerapi-2.0.3.tar.gz
Algorithm Hash digest
SHA256 cec51fd8018e020a79b5f6eb93f4c0dc2519dce6d82678437b0adfd02c416ca6
MD5 3b62db7b4af4afa912dcf3706645bc8b
BLAKE2b-256 4c2f1db3da9e722fb7b6c4cabd6eb88a7dbe4e2836ec3d01d2938244732f4ddd

See more details on using hashes here.

File details

Details for the file webcrawlerapi-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: webcrawlerapi-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for webcrawlerapi-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e61b1c46011da3fed066b82bf644e6f85465398e5b5c0011a613a4da037ff659
MD5 8842f8cfde10bacd50fe647a6089c00d
BLAKE2b-256 69d09e1817bb06620fb68469594f98eddb83c1870b35a95148a24c9a191de4a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page