Skip to main content

Official Python SDK for the scan-forge OCR service

Project description

scanforge

PyPI version Python License: MIT

Official Python SDK for the scan-forge OCR service — an on-premise, AI-powered drop-in replacement for ABBYY Recognition Server.

Installation

pip install scanforge

Requires Python 3.11+.

Quick Start

from scanforge import Client

client = Client(api_key="sf_live_...")

# Extract text from a PDF
result = client.ocr("faktura.pdf")
print(result.text)

# Detect barcodes
barcodes = client.barcodes("dokument.pdf")
for b in barcodes:
    print(b.value, b.type)

# Convert a scan to DOCX
client.convert("skan.png", output="wynik.docx")

API Reference

Client(api_key, base_url=...)

Creates a new client instance.

Parameter Type Required Default
api_key str Yes
base_url str No https://api.scanforge.tech
client = Client(
    api_key="sf_live_...",
    base_url="https://ocr.your-server.com",  # for self-hosted deployments
)

client.ocr(file_path, *, language=None, page_number=None, page_range=None, separate_pages=False, poll_interval=1.5, timeout=600)

Extracts text from a PDF or image file.

Internally this submits an asynchronous OCR job and polls it to completion, so the call blocks until the result is ready (or timeout seconds elapse). For full control over the lifecycle use the low-level submit_ocr / get_ocr_job methods instead.

Parameters

Parameter Type Default Description
file_path str Path to input file (PDF, PNG, JPG, TIFF)
language str | None None OCR language code; auto-detected server-side when omitted
page_number int | None None Process a single page (0-indexed)
page_range str | None None 1-indexed inclusive page range, e.g. "3" or "1-5". Takes precedence over page_number
separate_pages bool False Return each page separated by form-feed in text
poll_interval float 1.5 Seconds between job-status polls
timeout float 600 Max seconds to wait for the job before raising ScanForgeError

Returns OcrResult

@dataclass
class OcrResult:
    text: str
    pages: int
    metadata: dict[str, Any]

Example

result = client.ocr("invoice.pdf", language="eng")
print(result.text)    # extracted text
print(result.pages)   # number of pages processed

client.barcodes(file_path, *, page_number=0)

Detects and decodes barcodes (1D and 2D) in a document.

Parameters

Parameter Type Default Description
file_path str Path to input file
page_number int 0 Page to scan (0 = all pages)

Returns list[BarcodeResult]

@dataclass
class BarcodeResult:
    value: str   # decoded barcode content
    type: str    # symbology e.g. 'EAN-13', 'QR-Code', 'CODE-128'
    page: int    # 1-indexed page number

Example

barcodes = client.barcodes("shipment.pdf")
for b in barcodes:
    print(b.value, b.type, b.page)

client.convert(file_path, *, output)

Converts a PDF or image to an editable document format. The output format is determined by the extension of output (.docx → DOCX, .xlsx → XLSX).

Parameters

Parameter Type Default Description
file_path str Path to input file
output str Destination path (.docx or .xlsx)

Returns None — the converted file is downloaded and written to output locally.

Example

# Convert to Word document
client.convert("scan.pdf", output="result.docx")

# Convert to Excel spreadsheet (preserves table structure)
client.convert("table.pdf", output="data.xlsx")

Low-level asynchronous API

ocr() and convert() run on top of the asynchronous OCR backend: they submit a job and poll until it finishes. If you want to drive that lifecycle yourself — e.g. submit many files and poll later, or integrate with your own task queue — use the two low-level methods directly.

client.submit_ocr(file_path, *, fmt="TextUnicodeDefaults", language=None, page_number=None, page_range=None, separate_pages=False)

Uploads the file and enqueues an OCR job. Returns the raw response dict {"job_id": str, "status": "queued"}. Pass fmt="DOCX" or fmt="XLSX" for a conversion job.

client.get_ocr_job(job_id)

Fetches the current job state. Returns the raw job document:

{
    "job_id": str,
    "status": "queued" | "running" | "succeeded" | "failed",
    "created_at": str,
    "updated_at": str,
    "result": {...},   # present only when status == "succeeded"
    "error": str,      # present only when status == "failed"
}

Example

job = client.submit_ocr("invoice.pdf", page_range="1-5")
print(job["job_id"], job["status"])  # 'a1b2c3' 'queued'

# ...poll on your own schedule...
state = client.get_ocr_job(job["job_id"])
if state["status"] == "succeeded":
    print(state["result"]["text"])
elif state["status"] == "failed":
    print("failed:", state["error"])

Error Handling

All methods raise ScanForgeError on failure.

from scanforge import Client, ScanForgeError

client = Client(api_key="sf_live_...")

try:
    result = client.ocr("document.pdf")
except ScanForgeError as e:
    print(e)              # human-readable message
    print(e.status_code)  # HTTP status code (int or None for network errors)
    print(e.body)         # raw response body from the server
Error condition status_code
Invalid API key 401
Unsupported file type 422
Server error 5xx
Network / connection failure None

Configuration

Self-hosted deployment

Point the client at your own scan-forge server:

client = Client(
    api_key="sf_live_...",
    base_url="https://ocr.internal.example.com",
)

Environment variables (recommended)

import os
from scanforge import Client

client = Client(
    api_key=os.environ["SCANFORGE_API_KEY"],
    base_url=os.environ.get("SCANFORGE_URL", "http://localhost:8000"),
)

Requirements


License

MIT © Moonforge

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scanforge-1.1.0.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scanforge-1.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file scanforge-1.1.0.tar.gz.

File metadata

  • Download URL: scanforge-1.1.0.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scanforge-1.1.0.tar.gz
Algorithm Hash digest
SHA256 b8e01f33fd7e878c97d5595c84a51e64b7d8b4f0ccfcab288abd9f1b61594f0b
MD5 00b57aa981ba3c321496c30e6f9707ab
BLAKE2b-256 f213fdabced046ce8952badfb4391cf0bb2714efdd7e6e6238db7dbbac8c3354

See more details on using hashes here.

File details

Details for the file scanforge-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: scanforge-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scanforge-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f185a00257f8bf74435895bdbcc1be632d2e412fda30aab53917e5c9a2f34b5e
MD5 a57759aa018a0a76e156e27cb2d0c8af
BLAKE2b-256 7355b1edf209a9ba379960c2f9080495cc00fcd5e7e60f8b6ca86caf33679a7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page