Skip to main content

Python SDK for the Geonode Scraper API

Project description

Geonode Scraper SDK

Python SDK for the Geonode Scraper API. Supports single-URL extraction, batch extraction, site crawling, URL mapping, job polling, and usage statistics.

Requirements

  • Python 3.10+

Installation

pip install geonode-scraper-sdk

Configuration And Authentication

Create a client configuration with your API base URL and API key.

from geonode_scraper_sdk import Configuration

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

If you do not set host, the generated client defaults to http://localhost.

Quick Start

Synchronous extraction — blocks until the result is ready.

from geonode_scraper_sdk import (
    ApiClient,
    ApiException,
    Configuration,
    ExtractRequest,
    ExtractionApi,
    OutputFormat,
    ProcessingMode,
)

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    try:
        response = api.extract_v1_extract_post(
            ExtractRequest(
                url="https://example.com",
                formats=[OutputFormat.MARKDOWN],
                processing_mode=ProcessingMode.SYNC,
            )
        )
        print(response.data.markdown)
        print(response.tokens_charged)
    except ApiException as exc:
        print(exc.status)
        print(exc.body)

Async Extraction Workflow

When processing_mode=ProcessingMode.ASYNC, the extract call returns an async job response with a job ID and status URL.

from geonode_scraper_sdk import ApiClient, Configuration, ExtractRequest, ExtractionApi, ProcessingMode

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    submit = api.extract_v1_extract_post(
        ExtractRequest(
            url="https://example.com",
            processing_mode=ProcessingMode.ASYNC,
        )
    )

    job = api.get_job_result_v1_extract_job_id_get(submit.job_id)
    print(job.status)
    if job.data and job.data.markdown:
        print(job.data.markdown)

Use get_job_result_v1_extract_job_id_get(job_id) to poll a single job, or list_jobs_v1_extract_jobs_get(...) to inspect and filter job history.

Batch Extraction

Submit multiple URLs in one request and poll for results.

from geonode_scraper_sdk import ApiClient, BatchApi, BatchRequest, Configuration, OutputFormat

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = BatchApi(api_client)

    accepted = api.create_batch_v1_batch_post(
        BatchRequest(
            urls=["https://example.com", "https://example.org"],
            formats=[OutputFormat.MARKDOWN],
        )
    )
    print(accepted.job_id, accepted.accepted_urls)

    status = api.get_batch_status_v1_batch_job_id_get(
        job_id=accepted.job_id, page=1, page_size=10
    )
    print(status.status, status.completed_urls, status.total_urls)

Site Crawling

Crawl a website from a seed URL up to a configurable depth and page limit.

from geonode_scraper_sdk import ApiClient, Configuration, CrawlApi, CrawlRequest, OutputFormat

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = CrawlApi(api_client)

    accepted = api.create_crawl_v1_crawl_post(
        CrawlRequest(
            url="https://example.com",
            depth=2,
            limit=50,
            formats=[OutputFormat.MARKDOWN],
        )
    )
    print(accepted.job_id, accepted.estimated_pages)

    status = api.get_crawl_status_v1_crawl_job_id_get(
        job_id=accepted.job_id, page=1, page_size=10
    )
    print(status.status, status.completed_pages, status.total_pages)

URL Mapping

Discover all URLs under a base URL by combining sitemap parsing with HTML link extraction. Returns synchronously.

from geonode_scraper_sdk import ApiClient, Configuration, MapApi, MapRequest

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = MapApi(api_client)

    result = api.map_urls_v1_map_post(MapRequest(url="https://example.com"))
    for link in result.links:
        print(link.url, link.source)

Error Handling

Non-2xx responses raise ApiException or one of its subclasses. The exception includes the HTTP status, response body, and any deserialized error model in exc.data.

from geonode_scraper_sdk import ApiClient, ApiException, Configuration, ExtractionApi, ExtractRequest

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    try:
        api.extract_v1_extract_post(ExtractRequest(url="https://example.com"))
    except ApiException as exc:
        print(exc.status)
        print(exc.body)
        print(exc.data)

Request Options

ExtractRequest supports the following fields:

  • formats: output formats to return; defaults to [OutputFormat.HTML]
  • render_js: use a headless browser for JavaScript-rendered pages; defaults to False
  • processing_mode: ProcessingMode.SYNC or ProcessingMode.ASYNC; defaults to sync
  • proxy: optional ProxySettings for country and proxy type selection
  • headers: optional request headers dictionary
  • wait_config: optional WaitConfig for explicit browser wait policy (wait_until, wait_for, wait_timeout)

Example with additional options:

from geonode_scraper_sdk import ExtractRequest, OutputFormat, ProcessingMode, ProxySettings, ProxyType, WaitConfig, WaitUntil

request = ExtractRequest(
    url="https://example.com",
    formats=[OutputFormat.HTML, OutputFormat.MARKDOWN],
    render_js=True,
    processing_mode=ProcessingMode.SYNC,
    proxy=ProxySettings(country="US", type=ProxyType.RESIDENTIAL),
    headers={"User-Agent": "geonode-scraper-sdk-demo"},
    wait_config=WaitConfig(
        wait_until=WaitUntil.NETWORKIDLE,
        wait_for="#content",
        wait_timeout=2000,
    ),
)

API Reference

ExtractionApi (/v1/extract)

  • extract_v1_extract_post(extract_request)
  • get_job_result_v1_extract_job_id_get(job_id)
  • list_jobs_v1_extract_jobs_get(job_id, url, status, output, start_date, end_date, page, page_size)

BatchApi (/v1/batch)

  • create_batch_v1_batch_post(batch_request)
  • get_batch_status_v1_batch_job_id_get(job_id, page, page_size)
  • cancel_batch_v1_batch_job_id_delete(job_id)
  • list_batch_jobs_v1_batch_jobs_get(status, start_date, end_date, page, page_size)

CrawlApi (/v1/crawl)

  • create_crawl_v1_crawl_post(crawl_request)
  • get_crawl_status_v1_crawl_job_id_get(job_id, page, page_size)
  • cancel_crawl_v1_crawl_job_id_delete(job_id)
  • list_crawl_jobs_v1_crawl_jobs_get(url, status, start_date, end_date, page, page_size)

MapApi (/v1/map)

  • map_urls_v1_map_post(map_request)
  • list_map_jobs_v1_map_jobs_get(url, status, start_date, end_date, page, page_size)
  • get_map_job_v1_map_job_id_get(job_id)

StatisticsApi (/v1/statistics)

  • get_statistics_v1_statistics_get(start_date, end_date)

SystemApi (/health)

  • health_check_health_get()

WebhooksApi (/v1/webhooks)

  • list_webhooks_v1_webhooks_get(page, page_size)
  • create_webhook_v1_webhooks_post(webhook_create)
  • get_webhook_v1_webhooks_webhook_id_get(webhook_id)
  • update_webhook_v1_webhooks_webhook_id_patch(webhook_id, webhook_update)
  • delete_webhook_v1_webhooks_webhook_id_delete(webhook_id)
  • list_deliveries_v1_webhooks_webhook_id_deliveries_get(webhook_id, page, page_size, status)
  • rotate_secret_v1_webhooks_webhook_id_rotate_secret_post(webhook_id)

Advanced Usage

Each generated API method also exposes:

  • *_with_http_info() to get the deserialized payload together with status and headers
  • *_without_preload_content() to work with the raw HTTP response directly

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geonode_scraper_sdk-0.3.0.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geonode_scraper_sdk-0.3.0-py3-none-any.whl (135.0 kB view details)

Uploaded Python 3

File details

Details for the file geonode_scraper_sdk-0.3.0.tar.gz.

File metadata

  • Download URL: geonode_scraper_sdk-0.3.0.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for geonode_scraper_sdk-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c91f5f47810bd742b9f39cd816fec47cb12c2b6da614cde9f1402aee3858722d
MD5 37e15d4c65d2d18cea8efdc710d3652e
BLAKE2b-256 551444ae0868249e0ab20a5a5d65ce30488e1603c094b54bf083604aba1be802

See more details on using hashes here.

Provenance

The following attestation bundles were made for geonode_scraper_sdk-0.3.0.tar.gz:

Publisher: python-sdk-publish.yml on geonodecom/scraper-api-sdks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geonode_scraper_sdk-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for geonode_scraper_sdk-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa215623741d88ccad0e9b2d7d585611ec3331a08244427587780ec4844c0815
MD5 2df9b0d7b9a2c78691b9714d1df649bb
BLAKE2b-256 3afb435772de4372dde9e9348543778a8d82e7e1c41d630f0b1edff1d4d1b8ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for geonode_scraper_sdk-0.3.0-py3-none-any.whl:

Publisher: python-sdk-publish.yml on geonodecom/scraper-api-sdks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page