Python SDK for WebCrawler API
Project description
WebCrawler API Python SDK
A Python SDK for interacting with the WebCrawlerAPI.
In order to us API you have to get an API key from WebCrawlerAPI
Installation
pip install webcrawlerapi
Usage
from webcrawlerapi import WebCrawlerAPI
# Initialize the client
crawler = WebCrawlerAPI(api_key="your_api_key")
# Synchronous crawling (blocks until completion)
job = crawler.crawl(
url="https://example.com",
scrape_type="markdown",
items_limit=10,
webhook_url="https://yourserver.com/webhook",
allow_subdomains=False,
max_polls=100 # Optional: maximum number of status checks
)
print(f"Job completed with status: {job.status}")
# Access job items and their content
for item in job.job_items:
print(f"Page title: {item.title}")
print(f"Original URL: {item.original_url}")
print(f"Item status: {item.status}")
# Get the content based on job's scrape_type
# Returns None if item is not in "done" status
content = item.content
if content:
print(f"Content length: {len(content)}")
print(f"Content preview: {content[:200]}...")
else:
print("Content not available or item not done")
# Access job items and their parent job
for item in job.job_items:
print(f"Item URL: {item.original_url}")
print(f"Parent job status: {item.job.status}")
print(f"Parent job URL: {item.job.url}")
# Or use asynchronous crawling
response = crawler.crawl_async(
url="https://example.com",
scrape_type="markdown",
items_limit=10,
webhook_url="https://yourserver.com/webhook",
allow_subdomains=False
)
# Get the job ID from the response
job_id = response.id
print(f"Crawling job started with ID: {job_id}")
# Check job status and get results
job = crawler.get_job(job_id)
print(f"Job status: {job.status}")
# Access job details
print(f"Crawled URL: {job.url}")
print(f"Created at: {job.created_at}")
print(f"Number of items: {len(job.job_items)}")
# Cancel a running job if needed
cancel_response = crawler.cancel_job(job_id)
print(f"Cancellation response: {cancel_response['message']}")
API Methods
crawl()
Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:
- The job reaches a terminal state (done, error, or cancelled)
- The maximum number of polls is reached (default: 100)
- The polling interval is determined by the server's
recommended_pull_delay_msor defaults to 5 seconds
crawl_async()
Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.
get_job()
Retrieves the current status and details of a specific job.
cancel_job()
Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.
Parameters
Crawl Methods (crawl and crawl_async)
url(required): The seed URL where the crawler starts. Can be any valid URL.scrape_type(default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".items_limit(default: 10): Crawler will stop when it reaches this limit of pages for this job.webhook_url(optional): The URL where the server will send a POST request once the task is completed.allow_subdomains(default: False): If True, the crawler will also crawl subdomains.whitelist_regexp(optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.blacklist_regexp(optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.max_polls(optional, crawl only): Maximum number of status checks before returning (default: 100)
Responses
CrawlAsync Response
The crawl_async() method returns a CrawlResponse object with:
id: The unique identifier of the created job
Job Response
The Job object contains detailed information about the crawling job:
id: The unique identifier of the joborg_id: Your organization identifierurl: The seed URL where the crawler startedstatus: The status of the job (new, in_progress, done, error)scrape_type: The type of scraping performedcreated_at: The date when the job was createdfinished_at: The date when the job was finished (if completed)webhook_url: The webhook URL for notificationswebhook_status: The status of the webhook requestwebhook_error: Any error message if the webhook request failedjob_items: List of JobItem objects representing crawled pagesrecommended_pull_delay_ms: Server-recommended delay between status checks
JobItem Properties
Each JobItem object represents a crawled page and contains:
id: The unique identifier of the itemjob_id: The parent job identifierjob: Reference to the parent Job objectoriginal_url: The URL of the pagepage_status_code: The HTTP status code of the page requeststatus: The status of the item (new, in_progress, done, error)title: The page titlecreated_at: The date when the item was createdcost: The cost of the item in $referred_url: The URL where the page was referred fromlast_error: Any error message if the item failedcontent: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.raw_content_url: URL to the raw content (if available)cleaned_content_url: URL to the cleaned content (if scrape_type is "cleaned")markdown_content_url: URL to the markdown content (if scrape_type is "markdown")
Requirements
- Python 3.6+
- requests>=2.25.0
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webcrawlerapi-1.0.3.tar.gz.
File metadata
- Download URL: webcrawlerapi-1.0.3.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a219f2ba5afcce10eeb5aa43c8605ca0e01e03c9f25351121ebaed6a634b9c1e
|
|
| MD5 |
9f68466b6e6167c6d3b0a41a60c1dbe0
|
|
| BLAKE2b-256 |
994b1180d54e78637ec3cc8484ec1086812fda7b228722e23f7db104e66078a0
|
File details
Details for the file webcrawlerapi-1.0.3-py3-none-any.whl.
File metadata
- Download URL: webcrawlerapi-1.0.3-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4734635e09c73a3759423cbc5b55ed50961abfbc05fe030840ca8281dad4739e
|
|
| MD5 |
59bee04b07c03ac77b3b8f99e95eb176
|
|
| BLAKE2b-256 |
55347f09f45ca0efc9824ca94798e484489a0b67f8af27ca883398fbcbd3a2f2
|