Skip to main content

Python SDK for DivParser API - Web scraping and HTML parsing with AI-powered extraction

Project description

DivParser Python SDK

A Python SDK for DivParser - AI-powered web scraping and HTML parsing.

Features

  • Web Scraping: Extract structured data from web pages
  • HTML Parsing: Parse raw HTML content directly
  • Async Job Handling: Non-blocking job submission with status polling
  • Pagination Support: Scrape multiple URLs in a single batch
  • Simple API: Pythonic interface to the DivParser REST API

Installation

pip install divparser

Or if using uv:

uv pip install divparser

Quick Start

Setup

from divparser import DivParser

# Initialize the client with your API key
client = DivParser(api_key="your_api_key_here")

Get your API key from DivParser Console.

Scraping a Web Page

# Scrape a single page and wait for results
result = client.scrape_and_parse(
    url="https://example.com/products",
    schema="Extract product name, price, and rating from each item"
)

# Access the extracted data
for item in result["results"][0]["data"]:
    print(item)

Parsing HTML Content

# Parse HTML content directly
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"

result = client.parse_and_wait(
    html=html_content,
    schema="Extract all headings and paragraphs"
)

# Get the parsed data
data = result["results"][0]["data"]
print(data)

Paginated Scraping

# Scrape multiple URLs
urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3"
]

result = client.scrape_paginated(
    urls=urls,
    schema="Extract product name and price",
    wait=True
)

# Combine results from all pages
from divparser.utils import flatten_results
all_items = flatten_results(result["results"])

API Reference

Scraping

scrape(url, schema, name=None, page_type="LISTING", wait=False, timeout=300)

Create a scrape job for a single URL.

Parameters:

  • url (str): Target page URL
  • schema (str): Extraction instructions (plain English or Nestlang)
  • name (str, optional): Friendly label for this scrape
  • page_type (str): "LISTING" (default) or "DETAIL"
  • wait (bool): Wait for completion before returning
  • timeout (int): Max seconds to wait (only if wait=True)

Returns: Dictionary with scrapeId, jobId, and optionally results

scrape_paginated(urls, schema, name=None, page_type="LISTING", wait=False, timeout=300)

Create a scrape job for multiple URLs.

Parameters:

  • urls (List[str]): Array of URLs to scrape
  • schema (str): Extraction instructions
  • name (str, optional): Friendly label
  • page_type (str): "LISTING" or "DETAIL"
  • wait (bool): Wait for completion
  • timeout (int): Max seconds to wait

Returns: Dictionary with scrapeId, jobId, and optionally results

list_scrapes(limit=20, cursor=None)

List all scrapes for the authenticated user.

Returns: Dictionary with list of scrapes and pagination info

get_scrape(scrape_id)

Retrieve a scrape and its results by ID.

Parameters:

  • scrape_id (str): The scrapeId from creation

Returns: Dictionary with scrape details and results

Parsing

parse(html, schema, name=None, wait=False, timeout=300)

Submit raw HTML for structured extraction.

Parameters:

  • html (str): Full HTML content to parse
  • schema (str): Extraction instructions
  • name (str, optional): Friendly label
  • wait (bool): Wait for completion
  • timeout (int): Max seconds to wait

Returns: Dictionary with scrapeId, jobId, and optionally results

get_parse(parse_id)

Retrieve results for a completed parse job.

Parameters:

  • parse_id (str): The scrapeId from parse creation

Returns: Dictionary with parse details and results

Utilities

check_status(job_id)

Poll the status of a job.

Parameters:

  • job_id (str): The jobId returned from creation

Returns: Dictionary with completed (bool) and state (str)

wait_for_completion(job_id, timeout=300, poll_interval=1.0)

Wait for a job to complete.

Parameters:

  • job_id (str): The jobId to poll
  • timeout (int): Max seconds to wait
  • poll_interval (float): Seconds between polls

Returns: Status dictionary when completed

Raises: TimeoutError if job doesn't complete

Utility Functions

The divparser.utils module provides helper functions for working with results:

from divparser.utils import (
    extract_data_from_results,
    flatten_results,
    filter_results_by_status,
    get_results_by_url,
    get_result_stats
)

# Flatten nested results
all_items = flatten_results(results)

# Get statistics
stats = get_result_stats(results)
print(f"Success rate: {stats['success_rate']:.1f}%")

# Group by URL
by_url = get_results_by_url(results)

Examples

Example 1: Extract Job Listings

from divparser import DivParser

client = DivParser(api_key="your_api_key")

result = client.scrape_and_parse(
    url="https://example-jobs.com/listings",
    schema="""
    Extract the following for each job:
    - job title
    - company name
    - location
    - salary range (if available)
    """,
    name="Job Listings Scrape"
)

for job in result["results"][0]["data"]:
    print(f"{job['title']} at {job['company']} in {job['location']}")

Example 2: Parse Product Information from HTML

html_content = """
<html>
<body>
    <div class="product">
        <h2>Widget Pro</h2>
        <p class="price">$49.99</p>
        <p class="rating">4.5 stars</p>
    </div>
    <div class="product">
        <h2>Widget Lite</h2>
        <p class="price">$19.99</p>
        <p class="rating">4.2 stars</p>
    </div>
</body>
</html>
"""

result = client.parse_and_wait(
    html=html_content,
    schema="Extract product name, price, and rating"
)

for product in result["results"][0]["data"]:
    print(f"{product['name']}: {product['price']} ({product['rating']})")

Example 3: Batch Scraping Multiple Pages

from divparser.utils import flatten_results

pages = [f"https://example.com/products?page={i}" for i in range(1, 4)]

result = client.scrape_paginated(
    urls=pages,
    schema="Extract product ID, name, and price"
)

# Get all products from all pages
all_products = flatten_results(result["results"])
print(f"Total products: {len(all_products)}")

Error Handling

from divparser import DivParser
import requests

client = DivParser(api_key="your_api_key")

try:
    result = client.scrape_and_parse(
        url="https://example.com",
        schema="Extract content"
    )
except requests.exceptions.HTTPError as e:
    print(f"API Error: {e}")
except TimeoutError as e:
    print(f"Job timed out: {e}")

Best Practices

  1. Use Descriptive Schemas: Clear instructions in your schema lead to better extraction
  2. Set Appropriate Timeouts: Complex extractions may need longer timeouts
  3. Batch Operations: Use scrape_paginated for multiple URLs instead of individual requests
  4. Handle Errors: Always catch exceptions for production code
  5. Reuse Clients: Create one client instance and reuse it

API Documentation

For more detailed information, visit DivParser API Reference.

License

MIT

Support

For issues, questions, or feature requests, visit DivParser Support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

divparser-0.1.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

divparser-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file divparser-0.1.0.tar.gz.

File metadata

  • Download URL: divparser-0.1.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for divparser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8bc1d06536a099c6b080292653e2af44e78d3c5918ca6420203d92f93bb8c7ac
MD5 dc461b102f2140fa7d2bd385d85d68f5
BLAKE2b-256 c85e76409ce44775b54096f401ce3542896cea251dea96cd3d31067323d18c24

See more details on using hashes here.

File details

Details for the file divparser-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: divparser-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for divparser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e68646135648cbecd24bb7e7dfcb35f2ed691914d833570c1d04b08aa794b8c
MD5 107a560b57aff0a1d0c5b8b8b72d1c3c
BLAKE2b-256 1ad6d5c9fb4ab0c0306e4dac4b2b8aa59a1e92535e232a4d75dab35e69952a1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page