Skip to main content

Python SDK for the Geonode Scraper API

Project description

Geonode Scraper SDK

Python SDK for the Geonode Scraper API. Supports single-URL extraction, batch extraction, site crawling, URL mapping, job polling, and usage statistics.

Requirements

  • Python 3.10+

Installation

pip install geonode-scraper-sdk

Configuration And Authentication

Create a client configuration with your API base URL and API key.

from geonode_scraper_sdk import Configuration

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

If you do not set host, the generated client defaults to http://localhost.

Quick Start

Synchronous extraction — blocks until the result is ready.

from geonode_scraper_sdk import (
    ApiClient,
    ApiException,
    Configuration,
    ExtractRequest,
    ExtractionApi,
    OutputFormat,
    ProcessingMode,
)

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    try:
        response = api.extract_v1_extract_post(
            ExtractRequest(
                url="https://example.com",
                formats=[OutputFormat.MARKDOWN],
                processing_mode=ProcessingMode.SYNC,
            )
        )
        print(response.data.markdown)
        print(response.tokens_charged)
    except ApiException as exc:
        print(exc.status)
        print(exc.body)

Async Extraction Workflow

When processing_mode=ProcessingMode.ASYNC, the extract call returns an async job response with a job ID and status URL.

from geonode_scraper_sdk import ApiClient, Configuration, ExtractRequest, ExtractionApi, ProcessingMode

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    submit = api.extract_v1_extract_post(
        ExtractRequest(
            url="https://example.com",
            processing_mode=ProcessingMode.ASYNC,
        )
    )

    job = api.get_job_result_v1_extract_job_id_get(submit.job_id)
    print(job.status)
    if job.data and job.data.markdown:
        print(job.data.markdown)

Use get_job_result_v1_extract_job_id_get(job_id) to poll a single job, or list_jobs_v1_extract_jobs_get(...) to inspect and filter job history.

Batch Extraction

Submit multiple URLs in one request and poll for results.

from geonode_scraper_sdk import ApiClient, BatchApi, BatchRequest, Configuration, OutputFormat

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = BatchApi(api_client)

    accepted = api.create_batch_v1_batch_post(
        BatchRequest(
            urls=["https://example.com", "https://example.org"],
            formats=[OutputFormat.MARKDOWN],
        )
    )
    print(accepted.job_id, accepted.accepted_urls)

    status = api.get_batch_status_v1_batch_job_id_get(
        job_id=accepted.job_id, page=1, page_size=10
    )
    print(status.status, status.completed_urls, status.total_urls)

Site Crawling

Crawl a website from a seed URL up to a configurable depth and page limit.

from geonode_scraper_sdk import ApiClient, Configuration, CrawlApi, CrawlRequest, OutputFormat

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = CrawlApi(api_client)

    accepted = api.create_crawl_v1_crawl_post(
        CrawlRequest(
            url="https://example.com",
            depth=2,
            limit=50,
            formats=[OutputFormat.MARKDOWN],
        )
    )
    print(accepted.job_id, accepted.estimated_pages)

    status = api.get_crawl_status_v1_crawl_job_id_get(
        job_id=accepted.job_id, page=1, page_size=10
    )
    print(status.status, status.completed_pages, status.total_pages)

URL Mapping

Discover all URLs under a base URL by combining sitemap parsing with HTML link extraction. Returns synchronously.

from geonode_scraper_sdk import ApiClient, Configuration, MapApi, MapRequest

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = MapApi(api_client)

    result = api.map_urls_v1_map_post(MapRequest(url="https://example.com"))
    for link in result.links:
        print(link.url, link.source)

Error Handling

Non-2xx responses raise ApiException or one of its subclasses. The exception includes the HTTP status, response body, and any deserialized error model in exc.data.

from geonode_scraper_sdk import ApiClient, ApiException, Configuration, ExtractionApi, ExtractRequest

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    try:
        api.extract_v1_extract_post(ExtractRequest(url="https://example.com"))
    except ApiException as exc:
        print(exc.status)
        print(exc.body)
        print(exc.data)

Request Options

ExtractRequest supports the following fields:

  • formats: output formats to return; defaults to [OutputFormat.HTML]
  • render_js: use a headless browser for JavaScript-rendered pages; defaults to False
  • processing_mode: ProcessingMode.SYNC or ProcessingMode.ASYNC; defaults to sync
  • extract_links: extract all links found on the page; defaults to False
  • proxy: optional ProxySettings for country and proxy type selection
  • headers: optional request headers dictionary
  • wait_config: optional WaitConfig for explicit browser wait policy (wait_until, wait_for, wait_timeout)

Example with additional options:

from geonode_scraper_sdk import ExtractRequest, OutputFormat, ProcessingMode, ProxySettings, ProxyType, WaitConfig, WaitUntil

request = ExtractRequest(
    url="https://example.com",
    formats=[OutputFormat.HTML, OutputFormat.MARKDOWN],
    render_js=True,
    processing_mode=ProcessingMode.SYNC,
    extract_links=True,
    proxy=ProxySettings(country="US", type=ProxyType.RESIDENTIAL),
    headers={"User-Agent": "geonode-scraper-sdk-demo"},
    wait_config=WaitConfig(
        wait_until=WaitUntil.NETWORKIDLE,
        wait_for="#content",
        wait_timeout=2000,
    ),
)

API Reference

ExtractionApi (/v1/extract)

  • extract_v1_extract_post(extract_request)
  • get_job_result_v1_extract_job_id_get(job_id)
  • list_jobs_v1_extract_jobs_get(job_id, url, status, output, start_date, end_date, page, page_size)

BatchApi (/v1/batch)

  • create_batch_v1_batch_post(batch_request)
  • get_batch_status_v1_batch_job_id_get(job_id, page, page_size)
  • cancel_batch_v1_batch_job_id_delete(job_id)

CrawlApi (/v1/crawl)

  • create_crawl_v1_crawl_post(crawl_request)
  • get_crawl_status_v1_crawl_job_id_get(job_id, page, page_size)
  • cancel_crawl_v1_crawl_job_id_delete(job_id)

MapApi (/v1/map)

  • map_urls_v1_map_post(map_request)

StatisticsApi (/v1/statistics)

  • get_statistics_v1_statistics_get(start_date, end_date)

SystemApi (/health)

  • health_check_health_get()

WebhooksApi (/v1/webhooks)

  • list_webhooks_v1_webhooks_get(page, page_size)
  • create_webhook_v1_webhooks_post(webhook_create)
  • get_webhook_v1_webhooks_webhook_id_get(webhook_id)
  • update_webhook_v1_webhooks_webhook_id_patch(webhook_id, webhook_update)
  • delete_webhook_v1_webhooks_webhook_id_delete(webhook_id)
  • list_deliveries_v1_webhooks_webhook_id_deliveries_get(webhook_id, page, page_size, status)
  • rotate_secret_v1_webhooks_webhook_id_rotate_secret_post(webhook_id)

Advanced Usage

Each generated API method also exposes:

  • *_with_http_info() to get the deserialized payload together with status and headers
  • *_without_preload_content() to work with the raw HTTP response directly

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geonode_scraper_sdk-0.2.0.tar.gz (49.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geonode_scraper_sdk-0.2.0-py3-none-any.whl (122.9 kB view details)

Uploaded Python 3

File details

Details for the file geonode_scraper_sdk-0.2.0.tar.gz.

File metadata

  • Download URL: geonode_scraper_sdk-0.2.0.tar.gz
  • Upload date:
  • Size: 49.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for geonode_scraper_sdk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4c5ddf38d341b879f6154a6d3ab25fe71b71c94d09b61c169dbfec2d96c5e5e6
MD5 984010a40e9144f6f382812239be1d31
BLAKE2b-256 a5ec16355b64354934147f7ac9290ef68ed7d8db9efffbb5a44c071e8aca2f81

See more details on using hashes here.

Provenance

The following attestation bundles were made for geonode_scraper_sdk-0.2.0.tar.gz:

Publisher: python-sdk-publish.yml on geonodecom/scraper-api-sdks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geonode_scraper_sdk-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for geonode_scraper_sdk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07349e4f7f623e8c3e0dadf875686b8291f57b556daae5b5db84f8e7c3bb2248
MD5 d0f701b7708eb925931fd9b6f619e4b8
BLAKE2b-256 5e4efe36ce9bb86790eeb7994a79e767a23a6dbb6753a3cd014fc777623db341

See more details on using hashes here.

Provenance

The following attestation bundles were made for geonode_scraper_sdk-0.2.0-py3-none-any.whl:

Publisher: python-sdk-publish.yml on geonodecom/scraper-api-sdks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page