Official Python SDK for the scan-forge OCR service
Project description
scanforge
Official Python SDK for the scan-forge OCR service — an on-premise, AI-powered drop-in replacement for ABBYY Recognition Server.
Installation
pip install scanforge
Requires Python 3.11+.
Quick Start
from scanforge import Client
client = Client(api_key="sf_live_...")
# Extract text from a PDF
result = client.ocr("faktura.pdf")
print(result.text)
# Detect barcodes
barcodes = client.barcodes("dokument.pdf")
for b in barcodes:
print(b.value, b.type)
# Convert a scan to DOCX
client.convert("skan.png", output="wynik.docx")
API Reference
Client(api_key, base_url=...)
Creates a new client instance.
| Parameter | Type | Required | Default |
|---|---|---|---|
api_key |
str |
Yes | — |
base_url |
str |
No | https://api.scanforge.tech |
client = Client(
api_key="sf_live_...",
base_url="https://ocr.your-server.com", # for self-hosted deployments
)
client.ocr(file_path, *, language=None, page_number=None, page_range=None, separate_pages=False, poll_interval=1.5, timeout=600)
Extracts text from a PDF or image file.
Internally this submits an asynchronous OCR job and polls it to completion, so the call blocks until the result is ready (or timeout seconds elapse). For full control over the lifecycle use the low-level submit_ocr / get_ocr_job methods instead.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str |
— | Path to input file (PDF, PNG, JPG, TIFF) |
language |
str | None |
None |
OCR language code; auto-detected server-side when omitted |
page_number |
int | None |
None |
Process a single page (0-indexed) |
page_range |
str | None |
None |
1-indexed inclusive page range, e.g. "3" or "1-5". Takes precedence over page_number |
separate_pages |
bool |
False |
Return each page separated by form-feed in text |
poll_interval |
float |
1.5 |
Seconds between job-status polls |
timeout |
float |
600 |
Max seconds to wait for the job before raising ScanForgeError |
Returns OcrResult
@dataclass
class OcrResult:
text: str
pages: int
metadata: dict[str, Any]
Example
result = client.ocr("invoice.pdf", language="eng")
print(result.text) # extracted text
print(result.pages) # number of pages processed
client.barcodes(file_path, *, page_number=0)
Detects and decodes barcodes (1D and 2D) in a document.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str |
— | Path to input file |
page_number |
int |
0 |
Page to scan (0 = all pages) |
Returns list[BarcodeResult]
@dataclass
class BarcodeResult:
value: str # decoded barcode content
type: str # symbology e.g. 'EAN-13', 'QR-Code', 'CODE-128'
page: int # 1-indexed page number
Example
barcodes = client.barcodes("shipment.pdf")
for b in barcodes:
print(b.value, b.type, b.page)
client.convert(file_path, *, output)
Converts a PDF or image to an editable document format. The output format is determined by the extension of output (.docx → DOCX, .xlsx → XLSX).
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str |
— | Path to input file |
output |
str |
— | Destination path (.docx or .xlsx) |
Returns None — the converted file is downloaded and written to output locally.
Example
# Convert to Word document
client.convert("scan.pdf", output="result.docx")
# Convert to Excel spreadsheet (preserves table structure)
client.convert("table.pdf", output="data.xlsx")
Low-level asynchronous API
ocr() and convert() run on top of the asynchronous OCR backend: they submit a job and poll until it finishes. If you want to drive that lifecycle yourself — e.g. submit many files and poll later, or integrate with your own task queue — use the two low-level methods directly.
client.submit_ocr(file_path, *, fmt="TextUnicodeDefaults", language=None, page_number=None, page_range=None, separate_pages=False)
Uploads the file and enqueues an OCR job. Returns the raw response dict {"job_id": str, "status": "queued"}. Pass fmt="DOCX" or fmt="XLSX" for a conversion job.
client.get_ocr_job(job_id)
Fetches the current job state. Returns the raw job document:
{
"job_id": str,
"status": "queued" | "running" | "succeeded" | "failed",
"created_at": str,
"updated_at": str,
"result": {...}, # present only when status == "succeeded"
"error": str, # present only when status == "failed"
}
Example
job = client.submit_ocr("invoice.pdf", page_range="1-5")
print(job["job_id"], job["status"]) # 'a1b2c3' 'queued'
# ...poll on your own schedule...
state = client.get_ocr_job(job["job_id"])
if state["status"] == "succeeded":
print(state["result"]["text"])
elif state["status"] == "failed":
print("failed:", state["error"])
Error Handling
All methods raise ScanForgeError on failure.
from scanforge import Client, ScanForgeError
client = Client(api_key="sf_live_...")
try:
result = client.ocr("document.pdf")
except ScanForgeError as e:
print(e) # human-readable message
print(e.status_code) # HTTP status code (int or None for network errors)
print(e.body) # raw response body from the server
| Error condition | status_code |
|---|---|
| Invalid API key | 401 |
| Unsupported file type | 422 |
| Server error | 5xx |
| Network / connection failure | None |
Configuration
Self-hosted deployment
Point the client at your own scan-forge server:
client = Client(
api_key="sf_live_...",
base_url="https://ocr.internal.example.com",
)
Environment variables (recommended)
import os
from scanforge import Client
client = Client(
api_key=os.environ["SCANFORGE_API_KEY"],
base_url=os.environ.get("SCANFORGE_URL", "http://localhost:8000"),
)
Requirements
- Python 3.11+
- A running scan-forge server — see deployment docs
License
MIT © Moonforge
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scanforge-1.1.0.tar.gz.
File metadata
- Download URL: scanforge-1.1.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8e01f33fd7e878c97d5595c84a51e64b7d8b4f0ccfcab288abd9f1b61594f0b
|
|
| MD5 |
00b57aa981ba3c321496c30e6f9707ab
|
|
| BLAKE2b-256 |
f213fdabced046ce8952badfb4391cf0bb2714efdd7e6e6238db7dbbac8c3354
|
File details
Details for the file scanforge-1.1.0-py3-none-any.whl.
File metadata
- Download URL: scanforge-1.1.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f185a00257f8bf74435895bdbcc1be632d2e412fda30aab53917e5c9a2f34b5e
|
|
| MD5 |
a57759aa018a0a76e156e27cb2d0c8af
|
|
| BLAKE2b-256 |
7355b1edf209a9ba379960c2f9080495cc00fcd5e7e60f8b6ca86caf33679a7d
|