Python SDK for DivParser API - Web scraping and HTML parsing with AI-powered extraction

Project description

DivParser Python SDK

A Python SDK for DivParser - AI-powered web scraping and HTML parsing.

Features

Web Scraping: Extract structured data from web pages
HTML Parsing: Parse raw HTML content directly
Async Job Handling: Non-blocking job submission with status polling
Pagination Support: Scrape multiple URLs in a single batch
Simple API: Pythonic interface to the DivParser REST API

Installation

pip install divparser

Or if using uv:

uv pip install divparser

Quick Start

Setup

from divparser import DivParser

# Initialize the client with your API key
client = DivParser(api_key="your_api_key_here")

Get your API key from DivParser Console.

Scraping a Web Page

# Scrape a single page and wait for results
result = client.scrape_and_parse(
    url="https://example.com/products",
    schema="Extract product name, price, and rating from each item"
)

# Access the extracted data
for item in result["results"][0]["data"]:
    print(item)

Parsing HTML Content

# Parse HTML content directly
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"

result = client.parse_and_wait(
    html=html_content,
    schema="Extract all headings and paragraphs"
)

# Get the parsed data
data = result["results"][0]["data"]
print(data)

Paginated Scraping

# Scrape multiple URLs
urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3"
]

result = client.scrape_paginated(
    urls=urls,
    schema="Extract product name and price",
    wait=True
)

# Combine results from all pages
from divparser.utils import flatten_results
all_items = flatten_results(result["results"])

API Reference

Scraping

`scrape(url, schema, name=None, page_type="LISTING", wait=False, timeout=300)`

Create a scrape job for a single URL.

Parameters:

url (str): Target page URL
schema (str): Extraction instructions (plain English or Nestlang)
name (str, optional): Friendly label for this scrape
page_type (str): "LISTING" (default) or "DETAIL"
wait (bool): Wait for completion before returning
timeout (int): Max seconds to wait (only if wait=True)

Returns: Dictionary with scrapeId, jobId, and optionally results

`scrape_paginated(urls, schema, name=None, page_type="LISTING", wait=False, timeout=300)`

Create a scrape job for multiple URLs.

Parameters:

urls (List[str]): Array of URLs to scrape
schema (str): Extraction instructions
name (str, optional): Friendly label
page_type (str): "LISTING" or "DETAIL"
wait (bool): Wait for completion
timeout (int): Max seconds to wait

Returns: Dictionary with scrapeId, jobId, and optionally results

`list_scrapes(limit=20, cursor=None)`

List all scrapes for the authenticated user.

Returns: Dictionary with list of scrapes and pagination info

`get_scrape(scrape_id)`

Retrieve a scrape and its results by ID.

Parameters:

scrape_id (str): The scrapeId from creation

Returns: Dictionary with scrape details and results

Parsing

`parse(html, schema, name=None, wait=False, timeout=300)`

Submit raw HTML for structured extraction.

Parameters:

html (str): Full HTML content to parse
schema (str): Extraction instructions
name (str, optional): Friendly label
wait (bool): Wait for completion
timeout (int): Max seconds to wait

Returns: Dictionary with scrapeId, jobId, and optionally results

`get_parse(parse_id)`

Retrieve results for a completed parse job.

Parameters:

parse_id (str): The scrapeId from parse creation

Returns: Dictionary with parse details and results

Utilities

`check_status(job_id)`

Poll the status of a job.

Parameters:

job_id (str): The jobId returned from creation

Returns: Dictionary with completed (bool) and state (str)

`wait_for_completion(job_id, timeout=300, poll_interval=1.0)`

Wait for a job to complete.

Parameters:

job_id (str): The jobId to poll
timeout (int): Max seconds to wait
poll_interval (float): Seconds between polls

Returns: Status dictionary when completed

Raises: TimeoutError if job doesn't complete

Utility Functions

The divparser.utils module provides helper functions for working with results:

from divparser.utils import (
    extract_data_from_results,
    flatten_results,
    filter_results_by_status,
    get_results_by_url,
    get_result_stats
)

# Flatten nested results
all_items = flatten_results(results)

# Get statistics
stats = get_result_stats(results)
print(f"Success rate: {stats['success_rate']:.1f}%")

# Group by URL
by_url = get_results_by_url(results)

Examples

Example 1: Extract Job Listings

from divparser import DivParser

client = DivParser(api_key="your_api_key")

result = client.scrape_and_parse(
    url="https://example-jobs.com/listings",
    schema="""
    Extract the following for each job:
    - job title
    - company name
    - location
    - salary range (if available)
    """,
    name="Job Listings Scrape"
)

for job in result["results"][0]["data"]:
    print(f"{job['title']} at {job['company']} in {job['location']}")

Example 2: Parse Product Information from HTML

html_content = """
<html>
<body>
    <div class="product">
        <h2>Widget Pro</h2>
        <p class="price">$49.99</p>
        <p class="rating">4.5 stars</p>
    </div>
    <div class="product">
        <h2>Widget Lite</h2>
        <p class="price">$19.99</p>
        <p class="rating">4.2 stars</p>
    </div>
</body>
</html>
"""

result = client.parse_and_wait(
    html=html_content,
    schema="Extract product name, price, and rating"
)

for product in result["results"][0]["data"]:
    print(f"{product['name']}: {product['price']} ({product['rating']})")

Example 3: Batch Scraping Multiple Pages

from divparser.utils import flatten_results

pages = [f"https://example.com/products?page={i}" for i in range(1, 4)]

result = client.scrape_paginated(
    urls=pages,
    schema="Extract product ID, name, and price"
)

# Get all products from all pages
all_products = flatten_results(result["results"])
print(f"Total products: {len(all_products)}")

Error Handling

from divparser import DivParser
import requests

client = DivParser(api_key="your_api_key")

try:
    result = client.scrape_and_parse(
        url="https://example.com",
        schema="Extract content"
    )
except requests.exceptions.HTTPError as e:
    print(f"API Error: {e}")
except TimeoutError as e:
    print(f"Job timed out: {e}")

Best Practices

Use Descriptive Schemas: Clear instructions in your schema lead to better extraction
Set Appropriate Timeouts: Complex extractions may need longer timeouts
Batch Operations: Use scrape_paginated for multiple URLs instead of individual requests
Handle Errors: Always catch exceptions for production code
Reuse Clients: Create one client instance and reuse it

API Documentation

For more detailed information, visit DivParser API Reference.

License

MIT

Support

For issues, questions, or feature requests, visit DivParser Support.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

divparser-0.1.0.tar.gz (6.1 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

divparser-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file divparser-0.1.0.tar.gz.

File metadata

Download URL: divparser-0.1.0.tar.gz
Upload date: Jun 2, 2026
Size: 6.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for divparser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8bc1d06536a099c6b080292653e2af44e78d3c5918ca6420203d92f93bb8c7ac`
MD5	`dc461b102f2140fa7d2bd385d85d68f5`
BLAKE2b-256	`c85e76409ce44775b54096f401ce3542896cea251dea96cd3d31067323d18c24`

See more details on using hashes here.

File details

Details for the file divparser-0.1.0-py3-none-any.whl.

File metadata

Download URL: divparser-0.1.0-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for divparser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e68646135648cbecd24bb7e7dfcb35f2ed691914d833570c1d04b08aa794b8c`
MD5	`107a560b57aff0a1d0c5b8b8b72d1c3c`
BLAKE2b-256	`1ad6d5c9fb4ab0c0306e4dac4b2b8aa59a1e92535e232a4d75dab35e69952a1d`

See more details on using hashes here.

divparser 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DivParser Python SDK

Features

Installation

Quick Start

Setup

Scraping a Web Page

Parsing HTML Content

Paginated Scraping

API Reference

Scraping

scrape(url, schema, name=None, page_type="LISTING", wait=False, timeout=300)

scrape_paginated(urls, schema, name=None, page_type="LISTING", wait=False, timeout=300)

list_scrapes(limit=20, cursor=None)

get_scrape(scrape_id)

Parsing

parse(html, schema, name=None, wait=False, timeout=300)

get_parse(parse_id)

Utilities

check_status(job_id)

wait_for_completion(job_id, timeout=300, poll_interval=1.0)

Utility Functions

Examples

Example 1: Extract Job Listings

Example 2: Parse Product Information from HTML

Example 3: Batch Scraping Multiple Pages

Error Handling

Best Practices

API Documentation

License

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`scrape(url, schema, name=None, page_type="LISTING", wait=False, timeout=300)`

`scrape_paginated(urls, schema, name=None, page_type="LISTING", wait=False, timeout=300)`

`list_scrapes(limit=20, cursor=None)`

`get_scrape(scrape_id)`

`parse(html, schema, name=None, wait=False, timeout=300)`

`get_parse(parse_id)`

`check_status(job_id)`

`wait_for_completion(job_id, timeout=300, poll_interval=1.0)`