Python SDK for DivParser API - Web scraping and HTML parsing with AI-powered extraction
Project description
DivParser Python SDK
A Python SDK for DivParser - AI-powered web scraping and HTML parsing.
Features
- Web Scraping: Extract structured data from web pages
- HTML Parsing: Parse raw HTML content directly
- Async Job Handling: Non-blocking job submission with status polling
- Pagination Support: Scrape multiple URLs in a single batch
- Simple API: Pythonic interface to the DivParser REST API
Installation
pip install divparser
Or if using uv:
uv pip install divparser
Quick Start
Setup
from divparser import DivParser
# Initialize the client with your API key
client = DivParser(api_key="your_api_key_here")
Get your API key from DivParser Console.
Scraping a Web Page
# Scrape a single page and wait for results
result = client.scrape_and_parse(
url="https://example.com/products",
schema="Extract product name, price, and rating from each item"
)
# Access the extracted data
for item in result["results"][0]["data"]:
print(item)
Parsing HTML Content
# Parse HTML content directly
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"
result = client.parse_and_wait(
html=html_content,
schema="Extract all headings and paragraphs"
)
# Get the parsed data
data = result["results"][0]["data"]
print(data)
Paginated Scraping
# Scrape multiple URLs
urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3"
]
result = client.scrape_paginated(
urls=urls,
schema="Extract product name and price",
wait=True
)
# Combine results from all pages
from divparser.utils import flatten_results
all_items = flatten_results(result["results"])
API Reference
Scraping
scrape(url, schema, name=None, page_type="LISTING", wait=False, timeout=300)
Create a scrape job for a single URL.
Parameters:
url(str): Target page URLschema(str): Extraction instructions (plain English or Nestlang)name(str, optional): Friendly label for this scrapepage_type(str): "LISTING" (default) or "DETAIL"wait(bool): Wait for completion before returningtimeout(int): Max seconds to wait (only if wait=True)
Returns: Dictionary with scrapeId, jobId, and optionally results
scrape_paginated(urls, schema, name=None, page_type="LISTING", wait=False, timeout=300)
Create a scrape job for multiple URLs.
Parameters:
urls(List[str]): Array of URLs to scrapeschema(str): Extraction instructionsname(str, optional): Friendly labelpage_type(str): "LISTING" or "DETAIL"wait(bool): Wait for completiontimeout(int): Max seconds to wait
Returns: Dictionary with scrapeId, jobId, and optionally results
list_scrapes(limit=20, cursor=None)
List all scrapes for the authenticated user.
Returns: Dictionary with list of scrapes and pagination info
get_scrape(scrape_id)
Retrieve a scrape and its results by ID.
Parameters:
scrape_id(str): The scrapeId from creation
Returns: Dictionary with scrape details and results
Parsing
parse(html, schema, name=None, wait=False, timeout=300)
Submit raw HTML for structured extraction.
Parameters:
html(str): Full HTML content to parseschema(str): Extraction instructionsname(str, optional): Friendly labelwait(bool): Wait for completiontimeout(int): Max seconds to wait
Returns: Dictionary with scrapeId, jobId, and optionally results
get_parse(parse_id)
Retrieve results for a completed parse job.
Parameters:
parse_id(str): The scrapeId from parse creation
Returns: Dictionary with parse details and results
Utilities
check_status(job_id)
Poll the status of a job.
Parameters:
job_id(str): The jobId returned from creation
Returns: Dictionary with completed (bool) and state (str)
wait_for_completion(job_id, timeout=300, poll_interval=1.0)
Wait for a job to complete.
Parameters:
job_id(str): The jobId to polltimeout(int): Max seconds to waitpoll_interval(float): Seconds between polls
Returns: Status dictionary when completed
Raises: TimeoutError if job doesn't complete
Utility Functions
The divparser.utils module provides helper functions for working with results:
from divparser.utils import (
extract_data_from_results,
flatten_results,
filter_results_by_status,
get_results_by_url,
get_result_stats
)
# Flatten nested results
all_items = flatten_results(results)
# Get statistics
stats = get_result_stats(results)
print(f"Success rate: {stats['success_rate']:.1f}%")
# Group by URL
by_url = get_results_by_url(results)
Examples
Example 1: Extract Job Listings
from divparser import DivParser
client = DivParser(api_key="your_api_key")
result = client.scrape_and_parse(
url="https://example-jobs.com/listings",
schema="""
Extract the following for each job:
- job title
- company name
- location
- salary range (if available)
""",
name="Job Listings Scrape"
)
for job in result["results"][0]["data"]:
print(f"{job['title']} at {job['company']} in {job['location']}")
Example 2: Parse Product Information from HTML
html_content = """
<html>
<body>
<div class="product">
<h2>Widget Pro</h2>
<p class="price">$49.99</p>
<p class="rating">4.5 stars</p>
</div>
<div class="product">
<h2>Widget Lite</h2>
<p class="price">$19.99</p>
<p class="rating">4.2 stars</p>
</div>
</body>
</html>
"""
result = client.parse_and_wait(
html=html_content,
schema="Extract product name, price, and rating"
)
for product in result["results"][0]["data"]:
print(f"{product['name']}: {product['price']} ({product['rating']})")
Example 3: Batch Scraping Multiple Pages
from divparser.utils import flatten_results
pages = [f"https://example.com/products?page={i}" for i in range(1, 4)]
result = client.scrape_paginated(
urls=pages,
schema="Extract product ID, name, and price"
)
# Get all products from all pages
all_products = flatten_results(result["results"])
print(f"Total products: {len(all_products)}")
Error Handling
from divparser import DivParser
import requests
client = DivParser(api_key="your_api_key")
try:
result = client.scrape_and_parse(
url="https://example.com",
schema="Extract content"
)
except requests.exceptions.HTTPError as e:
print(f"API Error: {e}")
except TimeoutError as e:
print(f"Job timed out: {e}")
Best Practices
- Use Descriptive Schemas: Clear instructions in your schema lead to better extraction
- Set Appropriate Timeouts: Complex extractions may need longer timeouts
- Batch Operations: Use
scrape_paginatedfor multiple URLs instead of individual requests - Handle Errors: Always catch exceptions for production code
- Reuse Clients: Create one client instance and reuse it
API Documentation
For more detailed information, visit DivParser API Reference.
License
MIT
Support
For issues, questions, or feature requests, visit DivParser Support.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file divparser-0.1.0.tar.gz.
File metadata
- Download URL: divparser-0.1.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bc1d06536a099c6b080292653e2af44e78d3c5918ca6420203d92f93bb8c7ac
|
|
| MD5 |
dc461b102f2140fa7d2bd385d85d68f5
|
|
| BLAKE2b-256 |
c85e76409ce44775b54096f401ce3542896cea251dea96cd3d31067323d18c24
|
File details
Details for the file divparser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: divparser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e68646135648cbecd24bb7e7dfcb35f2ed691914d833570c1d04b08aa794b8c
|
|
| MD5 |
107a560b57aff0a1d0c5b8b8b72d1c3c
|
|
| BLAKE2b-256 |
1ad6d5c9fb4ab0c0306e4dac4b2b8aa59a1e92535e232a4d75dab35e69952a1d
|