Python SDK for the Geonode Scraper API
Project description
Geonode Scraper SDK
Python SDK for the Geonode Scraper API. Supports single-URL extraction, batch extraction, site crawling, URL mapping, job polling, and usage statistics.
Requirements
- Python 3.10+
Installation
pip install geonode-scraper-sdk
Configuration And Authentication
Create a client configuration with your API base URL and API key.
from geonode_scraper_sdk import Configuration
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
If you do not set host, the generated client defaults to http://localhost.
Quick Start
Synchronous extraction — blocks until the result is ready.
from geonode_scraper_sdk import (
ApiClient,
ApiException,
Configuration,
ExtractRequest,
ExtractionApi,
OutputFormat,
ProcessingMode,
)
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
with ApiClient(configuration) as api_client:
api = ExtractionApi(api_client)
try:
response = api.extract_v1_extract_post(
ExtractRequest(
url="https://example.com",
formats=[OutputFormat.MARKDOWN],
processing_mode=ProcessingMode.SYNC,
)
)
print(response.data.markdown)
print(response.tokens_charged)
except ApiException as exc:
print(exc.status)
print(exc.body)
Async Extraction Workflow
When processing_mode=ProcessingMode.ASYNC, the extract call returns an async
job response with a job ID and status URL.
from geonode_scraper_sdk import ApiClient, Configuration, ExtractRequest, ExtractionApi, ProcessingMode
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
with ApiClient(configuration) as api_client:
api = ExtractionApi(api_client)
submit = api.extract_v1_extract_post(
ExtractRequest(
url="https://example.com",
processing_mode=ProcessingMode.ASYNC,
)
)
job = api.get_job_result_v1_extract_job_id_get(submit.job_id)
print(job.status)
if job.data and job.data.markdown:
print(job.data.markdown)
Use get_job_result_v1_extract_job_id_get(job_id) to poll a single job, or
list_jobs_v1_extract_jobs_get(...) to inspect and filter job history.
Batch Extraction
Submit multiple URLs in one request and poll for results.
from geonode_scraper_sdk import ApiClient, BatchApi, BatchRequest, Configuration, OutputFormat
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
with ApiClient(configuration) as api_client:
api = BatchApi(api_client)
accepted = api.create_batch_v1_batch_post(
BatchRequest(
urls=["https://example.com", "https://example.org"],
formats=[OutputFormat.MARKDOWN],
)
)
print(accepted.job_id, accepted.accepted_urls)
status = api.get_batch_status_v1_batch_job_id_get(
job_id=accepted.job_id, page=1, page_size=10
)
print(status.status, status.completed_urls, status.total_urls)
Site Crawling
Crawl a website from a seed URL up to a configurable depth and page limit.
from geonode_scraper_sdk import ApiClient, Configuration, CrawlApi, CrawlRequest, OutputFormat
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
with ApiClient(configuration) as api_client:
api = CrawlApi(api_client)
accepted = api.create_crawl_v1_crawl_post(
CrawlRequest(
url="https://example.com",
depth=2,
limit=50,
formats=[OutputFormat.MARKDOWN],
)
)
print(accepted.job_id, accepted.estimated_pages)
status = api.get_crawl_status_v1_crawl_job_id_get(
job_id=accepted.job_id, page=1, page_size=10
)
print(status.status, status.completed_pages, status.total_pages)
URL Mapping
Discover all URLs under a base URL by combining sitemap parsing with HTML link extraction. Returns synchronously.
from geonode_scraper_sdk import ApiClient, Configuration, MapApi, MapRequest
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
with ApiClient(configuration) as api_client:
api = MapApi(api_client)
result = api.map_urls_v1_map_post(MapRequest(url="https://example.com"))
for link in result.links:
print(link.url, link.source)
Error Handling
Non-2xx responses raise ApiException or one of its subclasses.
The exception includes the HTTP status, response body, and any deserialized
error model in exc.data.
from geonode_scraper_sdk import ApiClient, ApiException, Configuration, ExtractionApi, ExtractRequest
configuration = Configuration(
host="https://api.example.com",
api_key={"APIKeyHeader": "your-api-key"},
)
with ApiClient(configuration) as api_client:
api = ExtractionApi(api_client)
try:
api.extract_v1_extract_post(ExtractRequest(url="https://example.com"))
except ApiException as exc:
print(exc.status)
print(exc.body)
print(exc.data)
Request Options
ExtractRequest supports the following fields:
formats: output formats to return; defaults to[OutputFormat.HTML]render_js: use a headless browser for JavaScript-rendered pages; defaults toFalseprocessing_mode:ProcessingMode.SYNCorProcessingMode.ASYNC; defaults to syncproxy: optionalProxySettingsfor country and proxy type selectionheaders: optional request headers dictionarywait_config: optionalWaitConfigfor explicit browser wait policy (wait_until,wait_for,wait_timeout)
Example with additional options:
from geonode_scraper_sdk import ExtractRequest, OutputFormat, ProcessingMode, ProxySettings, ProxyType, WaitConfig, WaitUntil
request = ExtractRequest(
url="https://example.com",
formats=[OutputFormat.HTML, OutputFormat.MARKDOWN],
render_js=True,
processing_mode=ProcessingMode.SYNC,
proxy=ProxySettings(country="US", type=ProxyType.RESIDENTIAL),
headers={"User-Agent": "geonode-scraper-sdk-demo"},
wait_config=WaitConfig(
wait_until=WaitUntil.NETWORKIDLE,
wait_for="#content",
wait_timeout=2000,
),
)
API Reference
ExtractionApi (/v1/extract)
extract_v1_extract_post(extract_request)get_job_result_v1_extract_job_id_get(job_id)list_jobs_v1_extract_jobs_get(job_id, url, status, output, start_date, end_date, page, page_size)
BatchApi (/v1/batch)
create_batch_v1_batch_post(batch_request)get_batch_status_v1_batch_job_id_get(job_id, page, page_size)cancel_batch_v1_batch_job_id_delete(job_id)list_batch_jobs_v1_batch_jobs_get(status, start_date, end_date, page, page_size)
CrawlApi (/v1/crawl)
create_crawl_v1_crawl_post(crawl_request)get_crawl_status_v1_crawl_job_id_get(job_id, page, page_size)cancel_crawl_v1_crawl_job_id_delete(job_id)list_crawl_jobs_v1_crawl_jobs_get(url, status, start_date, end_date, page, page_size)
MapApi (/v1/map)
map_urls_v1_map_post(map_request)list_map_jobs_v1_map_jobs_get(url, status, start_date, end_date, page, page_size)get_map_job_v1_map_job_id_get(job_id)
StatisticsApi (/v1/statistics)
get_statistics_v1_statistics_get(start_date, end_date)
SystemApi (/health)
health_check_health_get()
WebhooksApi (/v1/webhooks)
list_webhooks_v1_webhooks_get(page, page_size)create_webhook_v1_webhooks_post(webhook_create)get_webhook_v1_webhooks_webhook_id_get(webhook_id)update_webhook_v1_webhooks_webhook_id_patch(webhook_id, webhook_update)delete_webhook_v1_webhooks_webhook_id_delete(webhook_id)list_deliveries_v1_webhooks_webhook_id_deliveries_get(webhook_id, page, page_size, status)rotate_secret_v1_webhooks_webhook_id_rotate_secret_post(webhook_id)
Advanced Usage
Each generated API method also exposes:
*_with_http_info()to get the deserialized payload together with status and headers*_without_preload_content()to work with the raw HTTP response directly
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geonode_scraper_sdk-0.3.0.tar.gz.
File metadata
- Download URL: geonode_scraper_sdk-0.3.0.tar.gz
- Upload date:
- Size: 53.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c91f5f47810bd742b9f39cd816fec47cb12c2b6da614cde9f1402aee3858722d
|
|
| MD5 |
37e15d4c65d2d18cea8efdc710d3652e
|
|
| BLAKE2b-256 |
551444ae0868249e0ab20a5a5d65ce30488e1603c094b54bf083604aba1be802
|
Provenance
The following attestation bundles were made for geonode_scraper_sdk-0.3.0.tar.gz:
Publisher:
python-sdk-publish.yml on geonodecom/scraper-api-sdks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geonode_scraper_sdk-0.3.0.tar.gz -
Subject digest:
c91f5f47810bd742b9f39cd816fec47cb12c2b6da614cde9f1402aee3858722d - Sigstore transparency entry: 1800666076
- Sigstore integration time:
-
Permalink:
geonodecom/scraper-api-sdks@fbd6514ff39bee8ad055e3b78fb9ac47f7c8ae20 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/geonodecom
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-sdk-publish.yml@fbd6514ff39bee8ad055e3b78fb9ac47f7c8ae20 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file geonode_scraper_sdk-0.3.0-py3-none-any.whl.
File metadata
- Download URL: geonode_scraper_sdk-0.3.0-py3-none-any.whl
- Upload date:
- Size: 135.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa215623741d88ccad0e9b2d7d585611ec3331a08244427587780ec4844c0815
|
|
| MD5 |
2df9b0d7b9a2c78691b9714d1df649bb
|
|
| BLAKE2b-256 |
3afb435772de4372dde9e9348543778a8d82e7e1c41d630f0b1edff1d4d1b8ea
|
Provenance
The following attestation bundles were made for geonode_scraper_sdk-0.3.0-py3-none-any.whl:
Publisher:
python-sdk-publish.yml on geonodecom/scraper-api-sdks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geonode_scraper_sdk-0.3.0-py3-none-any.whl -
Subject digest:
aa215623741d88ccad0e9b2d7d585611ec3331a08244427587780ec4844c0815 - Sigstore transparency entry: 1800666200
- Sigstore integration time:
-
Permalink:
geonodecom/scraper-api-sdks@fbd6514ff39bee8ad055e3b78fb9ac47f7c8ae20 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/geonodecom
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-sdk-publish.yml@fbd6514ff39bee8ad055e3b78fb9ac47f7c8ae20 -
Trigger Event:
workflow_dispatch
-
Statement type: