Python SDK for WebCrawler API

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

WebCrawler API Python SDK

A Python SDK for interacting with the WebCrawlerAPI.

In order to us API you have to get an API key from WebCrawlerAPI

Installation

pip install webcrawlerapi

Usage

from webcrawlerapi import WebCrawlerAPI

### Crawling

# Initialize the client
crawler = WebCrawlerAPI(api_key="your_api_key")

# Synchronous crawling (blocks until completion)
job = crawler.crawl(
    url="https://example.com",
    scrape_type="markdown",
    items_limit=10,
    webhook_url="https://yourserver.com/webhook",
    allow_subdomains=False,
    max_polls=100  # Optional: maximum number of status checks. Use higher for bigger websites
)
print(f"Job completed with status: {job.status}")

# Access job items and their content
for item in job.job_items:
    print(f"Page title: {item.title}")
    print(f"Original URL: {item.original_url}")
    print(f"Item status: {item.status}")
    
    # Get the content based on job's scrape_type
    # Returns None if item is not in "done" status
    content = item.content
    if content:
        print(f"Content length: {len(content)}")
        print(f"Content preview: {content[:200]}...")
    else:
        print("Content not available or item not done")

# Access job items and their parent job
for item in job.job_items:
    print(f"Item URL: {item.original_url}")
    print(f"Parent job status: {item.job.status}")
    print(f"Parent job URL: {item.job.url}")

# Or use asynchronous crawling
response = crawler.crawl_async(
    url="https://example.com",
    scrape_type="markdown",
    items_limit=10,
    webhook_url="https://yourserver.com/webhook",
    allow_subdomains=False
)

# Get the job ID from the response
job_id = response.id
print(f"Crawling job started with ID: {job_id}")

# Check job status and get results
job = crawler.get_job(job_id)
print(f"Job status: {job.status}")

# Access job details
print(f"Crawled URL: {job.url}")
print(f"Created at: {job.created_at}")
print(f"Number of items: {len(job.job_items)}")

# Cancel a running job if needed
cancel_response = crawler.cancel_job(job_id)
print(f"Cancellation response: {cancel_response['message']}")

Scraping

The SDK provides both synchronous and asynchronous methods for single-page scraping using custom scrapers.

# Synchronous scraping - returns structured data directly
structured_data = crawler.scrape(
    crawler_id="webcrawler/url-to-md",  # ID of the custom scraper
    input_data={
        "url": "https://example.com"  # Scraper-specific input parameters
    },
    webhook_url="https://yourserver.com/webhook",  # Optional webhook
    max_polls=20  # Optional: maximum number of status checks
)
print(structured_data)  # Direct access to scraped data

API Methods

crawl()

Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:

The job reaches a terminal state (done, error, or cancelled)
The maximum number of polls is reached (default: 100)
The polling interval is determined by the server's recommended_pull_delay_ms or defaults to 5 seconds

crawl_async()

Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.

get_job()

Retrieves the current status and details of a specific job.

cancel_job()

Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.

scrape()

Starts a new scraping job and waits for its completion. Returns the structured data directly when the scraping is done. This method will continuously poll the status until:

The scraping is completed (status: "done")
The scraping fails (status: "error")
The maximum number of polls is reached (default: 100)

get_scrape()

Retrieves the current status, metadata and results of a specific scraping job. Returns a ScrapeResult object containing both status information and structured data.

Parameters

Crawl Methods (crawl and crawl_async)

url (required): The seed URL where the crawler starts. Can be any valid URL.
scrape_type (default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".
items_limit (default: 10): Crawler will stop when it reaches this limit of pages for this job.
webhook_url (optional): The URL where the server will send a POST request once the task is completed.
allow_subdomains (default: False): If True, the crawler will also crawl subdomains.
whitelist_regexp (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
blacklist_regexp (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.
max_polls (optional, crawl only): Maximum number of status checks before returning (default: 100)

Scrape Methods (scrape and scrape_async)

crawler_id (required): The ID of the custom scraper.
input_data (required): The input data for the scraper.
webhook_url (optional): The URL where the server will send a POST request once the task is completed.
max_polls (optional, scrape only): Maximum number of status checks before returning (default: 100)

Responses

CrawlAsync Response

The crawl_async() method returns a CrawlResponse object with:

id: The unique identifier of the created job

Job Response

The Job object contains detailed information about the crawling job:

id: The unique identifier of the job
org_id: Your organization identifier
url: The seed URL where the crawler started
status: The status of the job (new, in_progress, done, error)
scrape_type: The type of scraping performed
created_at: The date when the job was created
finished_at: The date when the job was finished (if completed)
webhook_url: The webhook URL for notifications
webhook_status: The status of the webhook request
webhook_error: Any error message if the webhook request failed
job_items: List of JobItem objects representing crawled pages
recommended_pull_delay_ms: Server-recommended delay between status checks

JobItem Properties

Each JobItem object represents a crawled page and contains:

id: The unique identifier of the item
job_id: The parent job identifier
job: Reference to the parent Job object
original_url: The URL of the page
page_status_code: The HTTP status code of the page request
status: The status of the item (new, in_progress, done, error)
title: The page title
created_at: The date when the item was created
cost: The cost of the item in $
referred_url: The URL where the page was referred from
last_error: Any error message if the item failed
content: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.
raw_content_url: URL to the raw content (if available)
cleaned_content_url: URL to the cleaned content (if scrape_type is "cleaned")
markdown_content_url: URL to the markdown content (if scrape_type is "markdown")

Requirements

Python 3.6+
requests>=2.25.0

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

2.1.0

Apr 13, 2026

2.0.12

Mar 7, 2026

2.0.11

Feb 8, 2026

2.0.10

Jan 17, 2026

2.0.9

Jan 5, 2026

2.0.8

Nov 10, 2025

2.0.7

Sep 11, 2025

2.0.6

Jul 20, 2025

2.0.5

Jun 12, 2025

2.0.4

Jun 6, 2025

2.0.3

May 27, 2025

2.0.1

May 26, 2025

2.0.0

May 25, 2025

1.0.8

May 7, 2025

1.0.7

May 1, 2025

1.0.6

Apr 11, 2025

1.0.5

Jan 3, 2025

This version

1.0.4

Jan 1, 2025

1.0.3

Dec 31, 2024

1.0.2

Dec 31, 2024

1.0.1

Dec 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcrawlerapi-1.0.4.tar.gz (8.9 kB view details)

Uploaded Jan 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webcrawlerapi-1.0.4-py3-none-any.whl (8.4 kB view details)

Uploaded Jan 1, 2025 Python 3

File details

Details for the file webcrawlerapi-1.0.4.tar.gz.

File metadata

Download URL: webcrawlerapi-1.0.4.tar.gz
Upload date: Jan 1, 2025
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for webcrawlerapi-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`a4ce3a7cc5282b5e7e17b906eed97d63e02611aa86aca7b896da9ab2760acee8`
MD5	`a8761b222d60bce51ef56115709850cf`
BLAKE2b-256	`5abe82b014647c0c9d613862f7280cba0702867d5d7b4de252791794f2ecfd7b`

See more details on using hashes here.

File details

Details for the file webcrawlerapi-1.0.4-py3-none-any.whl.

File metadata

Download URL: webcrawlerapi-1.0.4-py3-none-any.whl
Upload date: Jan 1, 2025
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for webcrawlerapi-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6231e689e20845848e8f6a2616d50b0c0cb8b6e8e0e5cb348f295c4ba1a048c9`
MD5	`4a294125d49c5e80065f882b5e9566c6`
BLAKE2b-256	`8b84145ad35c0b4885a902f906605cdb4b5ef2ea957005085abe1188ae7e5bc5`

See more details on using hashes here.

webcrawlerapi 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebCrawler API Python SDK

In order to us API you have to get an API key from WebCrawlerAPI

Installation

Usage

Scraping

API Methods

crawl()

crawl_async()

get_job()

cancel_job()

scrape()

get_scrape()

Parameters

Crawl Methods (crawl and crawl_async)

Scrape Methods (scrape and scrape_async)

Responses

CrawlAsync Response

Job Response

JobItem Properties

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes