Skip to main content

Retab official python library

Project description

Retab Python SDK

Official Python SDK for Retab document extraction.

Installation

pip install retab

The client reads RETAB_API_KEY from the environment by default.

Quick Start

import os

from retab import Retab

client = Retab(api_key=os.environ["RETAB_API_KEY"])

invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "invoice_date": {"type": "string"},
        "total_amount": {"type": "number"},
    },
    "required": ["invoice_number", "total_amount"],
}

result = client.documents.extract(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
)

print(result.data)
print(result.text)
print(result.likelihoods)
print(result.extraction_id)

documents.extract(...) returns a RetabParsedChatCompletion.

  • result.data is the parsed structured output
  • result.text is the raw JSON string
  • result.likelihoods mirrors the extracted structure with confidence signals
  • result.extraction_id can be used with the extractions API later

What extract Accepts

json_schema can be:

  • a Python dict
  • a path to a JSON schema file

document can be:

  • a local file path
  • a file-like object
  • a URL
  • MIMEData

Useful extraction options:

  • n_consensus: run multiple passes and reconcile the result
  • image_resolution_dpi: control image rendering quality for vision models
  • metadata: attach your own tags for later filtering
  • additional_messages: add extra instructions or context after the document content

Async Extraction

import os

from retab import AsyncRetab


async def main() -> None:
    client = AsyncRetab(api_key=os.environ["RETAB_API_KEY"])

    async with client:
        result = await client.documents.extract(
            json_schema={
                "type": "object",
                "properties": {
                    "booking_reference": {"type": "string"},
                    "guest_name": {"type": "string"},
                },
            },
            document="booking-confirmation.pdf",
            model="retab-micro",
        )

    print(result.data)

Streaming Extraction

extract_stream(...) yields partial RetabParsedChatCompletion objects as the JSON fills in.

from retab import Retab

client = Retab()

with client.documents.extract_stream(
    json_schema={
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "total_amount": {"type": "number"},
        },
    },
    document="invoice.pdf",
    model="retab-micro",
) as stream:
    for partial in stream:
        print(partial.data)

For async code:

async with client.documents.extract_stream(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
) as stream:
    async for partial in stream:
        print(partial.data)

Adding Context with additional_messages

The SDK supports the same message structure used in the tests: plain text messages, system or developer guidance, and multipart content.

result = client.documents.extract(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
    additional_messages=[
        {
            "role": "developer",
            "content": "Extract values exactly as written. Do not normalize vendor names.",
        },
        {
            "role": "user",
            "content": "Focus on invoice number, invoice date, and total amount due.",
        },
    ],
)

Working with Stored Extractions

Every extraction can be retrieved later through client.extractions.

result = client.documents.extract(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
    metadata={"batch_id": "march-2026"},
)

stored = client.extractions.get(result.extraction_id)
print(stored.predictions)

page_sources = client.extractions.sources(result.extraction_id)
print(page_sources.sources)

recent = client.extractions.list(limit=20, metadata={"batch_id": "march-2026"})
for item in recent.items:
    print(item.id, item.file.filename)

client.extractions.download(...) returns a pre-signed download URL for jsonl, csv, or xlsx exports.

Workflows

The Python SDK also supports workflow discovery, execution, and step inspection.

from pathlib import Path

from retab import Retab

client = Retab()

workflow = client.workflows.get_entities("wf_abc123")
document_start_id = workflow.start_nodes[0].id

run = client.workflows.runs.create(
    workflow_id=workflow.workflow.id,
    documents={document_start_id: Path("invoice.pdf")},
)

run = client.workflows.runs.wait_for_completion(run.id, poll_interval_seconds=1.0)
run.raise_for_status()

print(run.output)

step = client.workflows.runs.steps.get(run.id, "extract-node-id")
print(step.extracted_data)

Useful workflow helpers:

  • client.workflows.get_entities(workflow_id) returns the workflow graph and exposes .start_nodes and .start_json_nodes
  • client.workflows.runs.wait_for_completion(run.id) polls until the run reaches completed, error, or cancelled
  • client.workflows.runs.steps.get(run.id, node_id) returns typed handle inputs and outputs
  • client.workflows.runs.steps.get_all(run) fetches step outputs for every node in one call
  • client.workflows.blocks.* and client.workflows.edges.* let you create or update workflow graphs from code

Notes

  • n_consensus=1 is the fastest option
  • higher n_consensus usually improves robustness on noisy or ambiguous documents
  • if schema validation fails, result.choices[0].message.parsed may be None

Links

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retab-0.0.110.tar.gz (135.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

retab-0.0.110-py3-none-any.whl (153.9 kB view details)

Uploaded Python 3

File details

Details for the file retab-0.0.110.tar.gz.

File metadata

  • Download URL: retab-0.0.110.tar.gz
  • Upload date:
  • Size: 135.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for retab-0.0.110.tar.gz
Algorithm Hash digest
SHA256 a0ef7b4a48d68f60249111ea374d0f3a726015689f927e737002e17f5fb5733a
MD5 b52d038e5234976ae94f96089dfc2025
BLAKE2b-256 c2627907c63bb693b8976e84d83c231207d6c0836a8c2d29591968b905b47614

See more details on using hashes here.

File details

Details for the file retab-0.0.110-py3-none-any.whl.

File metadata

  • Download URL: retab-0.0.110-py3-none-any.whl
  • Upload date:
  • Size: 153.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for retab-0.0.110-py3-none-any.whl
Algorithm Hash digest
SHA256 2133f4af4b03ad43de89783e75e701a8b56db125183e3737adb55c87011e0ae4
MD5 04291ac2a8b3fa2ef25bba1f8b9bc45a
BLAKE2b-256 7cf275f4101d58907dc1dfa7bdd58aeaaa9b8971e8ebf958c5bc815e89030bf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page