The official Python SDK for Spidra

These details have not been verified by PyPI

Project links

Project description

Spidra Python SDK

The official Python SDK for Spidra that allows you to scrape pages, run browser actions, batch-process URLs, and crawl entire sites. All results come back as structured data ready to feed into your LLM pipelines or store directly.

Installation

pip install spidra

Get your API key at app.spidra.io under Settings > API Keys.

Quick start

In a Python script

import asyncio
from spidra import SpidraClient, ScrapeParams, ScrapeUrl

async def main():
    spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

    job = await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://news.ycombinator.com")],
        prompt="List the top 5 stories with title, points, and comment count",
        output="json",
    ))

    print(job.result.content)

asyncio.run(main())

In a Jupyter Notebook

Jupyter already runs its own event loop, so you cannot use asyncio.run(). Instead, await directly in a cell:

from spidra import SpidraClient, ScrapeParams, ScrapeUrl

spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://news.ycombinator.com")],
    prompt="List the top 5 stories with title, points, and comment count",
    output="json",
))

print(job.result.content)

Why? Jupyter and IPython run inside an existing asyncio event loop. Calling asyncio.run() tries to start a second loop, which Python does not allow. Using bare await in a cell works because Jupyter patches its loop to accept top-level awaits.

Synchronous usage (no async/await)

Every async method has a _sync counterpart that works anywhere — normal scripts, Jupyter notebooks, Django views, Flask routes, etc.

from spidra import SpidraClient, ScrapeParams, ScrapeUrl

spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

# No async, no await — just call it
job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://news.ycombinator.com")],
    prompt="List the top 5 stories with title, points, and comment count",
    output="json",
))

print(job.result.content)

All sync methods:

Resource	Sync methods
`spidra.scrape`	`submit_sync()`, `get_sync()`, `run_sync()`
`spidra.batch`	`list_sync()`, `submit_sync()`, `get_sync()`, `retry_sync()`, `cancel_sync()`, `run_sync()`
`spidra.crawl`	`history_sync()`, `stats_sync()`, `submit_sync()`, `get_sync()`, `pages_sync()`, `extract_sync()`, `run_sync()`
`spidra.logs`	`list_sync()`, `get_sync()`
`spidra.usage`	`get_sync()`

Spidra Python SDK

Scraping

All scrape jobs run asynchronously. The run() method submits a job and polls until it finishes. If you need more control, use submit() and get() directly.

Up to 3 URLs can be passed per request and they are processed in parallel.

Basic scrape

from spidra import SpidraClient, ScrapeParams, ScrapeUrl

async def main():
    spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

    job = await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com/pricing")],
        prompt="Extract all pricing plans with name, price, and included features",
        output="json",
    ))

    print(job.result.content)
    # { "plans": [{ "name": "Starter", "price": "$9/mo", "features": [...] }, ...] }

Structured output with JSON schema

When you need a guaranteed shape, pass a schema. The API will enforce the structure and return None for any missing fields rather than hallucinating values.

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
    prompt="Extract the job listing details",
    output="json",
    schema={
        "type": "object",
        "required": ["title", "company", "remote"],
        "properties": {
            "title":      { "type": "string" },
            "company":    { "type": "string" },
            "remote":     { "type": ["boolean", "null"] },
            "salary_min": { "type": ["number", "null"] },
            "salary_max": { "type": ["number", "null"] },
            "skills":     { "type": "array", "items": { "type": "string" } },
        },
    },
))

Geo-targeted scraping

Pass use_proxy=True and a proxy_country code to route the request through a specific country. Useful for geo-restricted content or localized pricing.

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://www.amazon.de/gp/bestsellers")],
    prompt="List the top 10 products with name and price",
    use_proxy=True,
    proxy_country="de",
))

Supported country codes include: us, gb, de, fr, jp, au, ca, br, in, nl, sg, es, it, mx, and 40+ more. Use "global" or "eu" for regional routing.

Authenticated pages

Pass cookies as a string to scrape pages that require a login session.

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://app.example.com/dashboard")],
    prompt="Extract the monthly revenue and active user count",
    cookies="session=abc123; auth_token=xyz789",
))

Browser actions

Actions let you interact with the page before the scrape runs. They execute in order, and the scrape happens after all actions complete.

from spidra import BrowserAction

job = await spidra.scrape.run(ScrapeParams(
    urls=[
        ScrapeUrl(
            url="https://example.com/products",
            actions=[
                BrowserAction(type="click", selector="#accept-cookies"),
                BrowserAction(type="wait", duration=1000),
                BrowserAction(type="scroll", to="80%"),
            ],
        ),
    ],
    prompt="Extract all product names and prices",
))

Available actions:

Action	Required fields	Description
`click`	`selector` or `value`	Click a button, link, or any element
`type`	`selector`, `value`	Type text into an input or textarea
`check`	`selector` or `value`	Check a checkbox
`uncheck`	`selector` or `value`	Uncheck a checkbox
`wait`	`duration` (ms)	Pause execution for a set number of milliseconds
`scroll`	`to` (0–100%)	Scroll the page to a percentage of its height
`forEach`	`observe`	Loop over every matched element and process each one

For selector, use a CSS selector or XPath. For value, use a plain English description and Spidra will locate the element using AI.

# CSS selector
BrowserAction(type="click", selector="button[data-testid='submit']")

# Plain English
BrowserAction(type="click", value="Accept all cookies button")

# Type into a field
BrowserAction(type="type", selector="input[name='q']", value="wireless headphones")

# Wait for content to load
BrowserAction(type="wait", duration=2000)

# Scroll to bottom
BrowserAction(type="scroll", to="100%")

forEach: process every element on a page

forEach finds a set of elements on the page and processes each one individually. It is the right tool when you need to collect data from a list of items, paginate through multiple pages, or click into each item's detail page.

You don't need forEach if the data fits on a single page and is short — a plain prompt is simpler and works just as well.

Use forEach when:

The list spans multiple pages and you need pagination
You need to click into each item's detail page (navigate mode)
You have 20+ items and want per-item AI extraction to stay consistent (item_prompt)

inline mode

Read each element's content directly without navigating. Best for product cards, search results, table rows.

from spidra import BrowserAction, BrowserActionPagination

job = await spidra.scrape.run(ScrapeParams(
    urls=[
        ScrapeUrl(
            url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
            actions=[
                BrowserAction(
                    type="forEach",
                    observe="Find all book cards in the product grid",
                    mode="inline",
                    capture_selector="article.product_pod",
                    max_items=20,
                    item_prompt="Extract title, price, and star rating. Return as JSON: {title, price, star_rating}",
                ),
            ],
        ),
    ],
    prompt="Return a clean JSON array of all books",
    output="json",
))

navigate mode

Follow each element's link to its destination page and capture content there. Best for product listings where the full detail is only on the individual page.

BrowserAction(
    type="forEach",
    observe="Find all book title links in the product grid",
    mode="navigate",
    capture_selector="article.product_page",
    max_items=10,
    wait_after_click=800,
    item_prompt="Extract title, price, star rating, and availability. Return as JSON.",
)

click mode

Click each element, capture the content that appears (a modal, drawer, or expanded section), then move on. Best for hotel room cards, FAQ accordions, or any UI where clicking reveals hidden content.

BrowserAction(
    type="forEach",
    observe="Find all room type cards",
    mode="click",
    capture_selector="[role='dialog']",
    max_items=8,
    wait_after_click=1200,
    item_prompt="Extract room name, bed type, price per night, and amenities. Return as JSON.",
)

Pagination

After processing all elements on the current page, follow the next-page link and continue collecting.

from spidra import BrowserActionPagination

BrowserAction(
    type="forEach",
    observe="Find all book title links",
    mode="navigate",
    max_items=40,
    pagination=BrowserActionPagination(
        next_selector="li.next > a",
        max_pages=3,  # 3 additional pages beyond the first
    ),
)

max_items applies across all pages combined. The loop stops when you hit max_items, run out of pages, or reach max_pages.

Per-element actions

Run additional browser actions on each item after navigating or clicking into it, before the content is captured.

BrowserAction(
    type="forEach",
    observe="Find all book title links",
    mode="navigate",
    capture_selector="article.product_page",
    max_items=5,
    wait_after_click=1000,
    actions=[
        BrowserAction(type="scroll", to="50%"),
    ],
    item_prompt="Extract title, price, and full description. Return as JSON.",
)

item_prompt vs top-level prompt

Both are optional and serve different purposes.

	`item_prompt`	`prompt`
When it runs	During scraping, once per item	After all items are collected
What it sees	One item's content	All items combined
Output location	`result.data[].markdown_content`	`result.content`

Manual job control

Use submit() and get() when you want to manage polling yourself, or fire-and-forget and check back later.

# Submit a job and get the job_id immediately
queued = await spidra.scrape.submit(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com")],
    prompt="Extract the main headline",
))

# Check status at any point
status = await spidra.scrape.get(queued.job_id)

if status.status == "completed":
    print(status.result.content)
elif status.status == "failed":
    print(status.error)

Job statuses: waiting, active, completed, failed.

Poll options

scrape.run(), batch.run(), and crawl.run() accept an optional PollOptions argument to control polling behavior.

from spidra import PollOptions

job = await spidra.scrape.run(
    params,
    PollOptions(poll_interval=3.0, timeout=120.0),
)

Batch scraping

Submit up to 50 URLs in a single request. All URLs are processed in parallel. Each URL is a plain string.

from spidra import BatchScrapeParams

batch = await spidra.batch.run(BatchScrapeParams(
    urls=[
        "https://shop.example.com/product/1",
        "https://shop.example.com/product/2",
        "https://shop.example.com/product/3",
    ],
    prompt="Extract product name, price, and availability",
    output="json",
    use_proxy=True,
))

for item in batch.items:
    if item.status == "completed":
        print(item.url, item.result)
    elif item.status == "failed":
        print(item.url, item.error)

Item statuses: pending, running, completed, failed.

Retry failed items:

queued = await spidra.batch.submit(BatchScrapeParams(
    urls=["https://example.com/1", "https://example.com/2"],
    prompt="Extract the page title",
))

# Later, after checking status
result = await spidra.batch.get(queued.batch_id)
if result.failed_count > 0:
    await spidra.batch.retry(queued.batch_id)

Cancel a running batch:

response = await spidra.batch.cancel(batch_id)
print(f"Cancelled {response.cancelled_items} items, refunded {response.credits_refunded} credits")

List past batches:

from spidra import BatchListParams

response = await spidra.batch.list(BatchListParams(page=1, limit=20))

for job in response.jobs:
    print(job.uuid, job.status, f"{job.completed_count}/{job.total_urls}")

Crawling

Given a starting URL, Spidra discovers pages automatically according to your instruction and extracts structured data from each one.

from spidra import CrawlParams

job = await spidra.crawl.run(CrawlParams(
    base_url="https://competitor.com/blog",
    crawl_instruction="Find all blog posts published in 2024",
    transform_instruction="Extract the title, author, publish date, and a one-sentence summary",
    max_pages=30,
    use_proxy=True,
))

for page in job.result:
    print(page.url, page.data)

Submit without waiting:

queued = await spidra.crawl.submit(CrawlParams(
    base_url="https://example.com/docs",
    crawl_instruction="Find all documentation pages",
    transform_instruction="Extract the page title and main content summary",
    max_pages=50,
))

# Check status later
status = await spidra.crawl.get(queued.job_id)

Get signed download URLs for all crawled pages:

Each page includes html_url and markdown_url pointing to S3-signed URLs that expire after 1 hour.

response = await spidra.crawl.pages(job_id)

for page in response.pages:
    print(page.url, page.status)
    # Download raw HTML: page.html_url
    # Download markdown: page.markdown_url

Re-extract with a new instruction:

Runs a new AI transformation over an existing completed crawl without re-crawling any pages. Charges credits for the transformation only.

queued = await spidra.crawl.extract(source_job_id, "Extract only the product SKUs and prices as a CSV")

# Poll the new job manually
result = await spidra.crawl.get(queued.job_id)

Crawl history and stats:

from spidra import CrawlHistoryParams

response = await spidra.crawl.history(CrawlHistoryParams(page=1, limit=10))
stats = await spidra.crawl.stats()
print(f"Total crawls: {stats.total}")

Logs

Scrape logs are stored for every job that runs through the API.

from spidra import ScrapeLogsParams

# List logs with optional filters
response = await spidra.logs.list(ScrapeLogsParams(
    status="failed",
    search_term="amazon.com",
    channel="api",
    date_start="2024-01-01",
    date_end="2024-12-31",
    page=1,
    limit=20,
))

for log in response.logs:
    print(log.urls[0].get("url"), log.status, log.credits_used)

Get a single log with full extraction result:

log = await spidra.logs.get("log-uuid")
print(log.result_data)  # the full AI output for that job

Usage statistics

Returns credit and request usage broken down by day or week.

# Range options: "7d" | "30d" | "weekly"
rows = await spidra.usage.get("30d")

for row in rows:
    print(row.date, row.requests, row.credits, row.tokens)

Error handling

Every API error raises a typed exception. Catch the specific class you care about or fall back to the base SpidraError.

from spidra import (
    SpidraClient,
    SpidraError,
    SpidraAuthenticationError,
    SpidraInsufficientCreditsError,
    SpidraRateLimitError,
    SpidraServerError,
    ScrapeParams,
    ScrapeUrl,
)

try:
    await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com")],
        prompt="...",
    ))
except SpidraAuthenticationError:
    # 401: API key is missing or invalid
    print("Check your API key")
except SpidraInsufficientCreditsError:
    # 403: Monthly credit limit reached
    print("Out of credits")
except SpidraRateLimitError:
    # 429: Too many requests
    print("Rate limited, back off and retry")
except SpidraServerError:
    # 500: Something went wrong on Spidra's side
    print("Server error, try again")
except SpidraError as e:
    # Any other API error
    print(f"{e.status}: {e.message}")

All error classes expose err.status (HTTP status code) and err.message.

Debugging

Enable debug logging to see every HTTP request, response, and retry attempt:

import logging
logging.getLogger("spidra").setLevel(logging.DEBUG)

Sample output:

DEBUG:spidra:POST /scrape (attempt 1/4)
DEBUG:spidra:Response 200 in 1.23s

Context manager

Use SpidraClient as an async context manager to ensure the HTTP connection pool is properly closed.

async with SpidraClient(api_key="spd_YOUR_API_KEY") as spidra:
    job = await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com")],
        prompt="Extract the page title",
    ))
    print(job.result.content)

Requirements

Python 3.9 or later
A Spidra API key (sign up free)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Apr 20, 2026

0.2.1

Apr 19, 2026

0.2.0

Apr 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidra-0.2.2.tar.gz (32.2 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spidra-0.2.2-py3-none-any.whl (30.3 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file spidra-0.2.2.tar.gz.

File metadata

Download URL: spidra-0.2.2.tar.gz
Upload date: Apr 20, 2026
Size: 32.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for spidra-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`a9e707ba5af91de13ec60e9423747ad076f90be284f83397afb21f9c7bb05e17`
MD5	`fb5bfd1ab7397136057e3aa30f999b9b`
BLAKE2b-256	`0d5f623b36484108f235a575e0f0339c23fbf29cefcda37e6b8e7c3cc27812ec`

See more details on using hashes here.

File details

Details for the file spidra-0.2.2-py3-none-any.whl.

File metadata

Download URL: spidra-0.2.2-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 30.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for spidra-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`21219701689587ee0681408bf1e06ec37c9c3cceef4b70c988dc4a8be33d06c8`
MD5	`db50a6016b30ab9edc7f78f3881576ad`
BLAKE2b-256	`979f374eb1a70930cb58814e25830604f31b38214c5b5eb8373a12317a70dc57`

See more details on using hashes here.

spidra 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spidra Python SDK

Installation

Quick start

In a Python script

In a Jupyter Notebook

Synchronous usage (no async/await)

Table of contents

Scraping

Basic scrape

Structured output with JSON schema

Geo-targeted scraping

Authenticated pages

Browser actions

forEach: process every element on a page

inline mode

navigate mode

click mode

Pagination

Per-element actions

item_prompt vs top-level prompt

Manual job control

Poll options

Batch scraping

Crawling

Logs

Usage statistics

Error handling

Debugging

Context manager

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes