Skip to main content

Extract structured data from PDFs and images by passing a schema.

Project description

Datalab Logo

Datalab

State of the Art models for Document Intelligence

Code License Model License Discord

Homepage Docs Public Playground


lift

lift extracts structured JSON from PDFs and images by passing a schema. It's a 9B vision model that returns a JSON object matching your schema, with schema-constrained decoding guaranteeing valid output.

lift extracting schema-aligned JSON from an invoice Extraction accuracy benchmark

Try lift on Datalab

Our managed platform runs improved extraction with higher accuracy than the open weights, plus per-field verification, citations, and confidence scores.

If you have high volume workloads, we offer a batch processing service that has processed 1B+ pages per week.

Get started with $20 in free credits per monthsign up - takes under 30 seconds - or try lift in our public playground.

Commercial self-hosting requires a license — see Commercial usage. For on-prem licensing, contact us.

Features

  • Extract structured data from documents
  • Pass any JSON schema
  • Handles multi-page documents in a single pass, including values that span pages
  • Two inference modes: local (HuggingFace) and remote (vLLM server)
  • CLI for single files, inline schemas, or whole directories
  • Schema Studio: a Streamlit app to build, save, and test schemas against your documents

Quickstart

The easiest way to start is with the CLI tools:

pip install lift-pdf

# With vLLM (recommended, lightweight install)
lift_vllm
lift_extract input.pdf ./output --schema schema.json

# With HuggingFace (requires torch)
pip install lift-pdf[hf]
lift_extract input.pdf ./output --schema schema.json --method hf

Benchmarks

Evaluated on a 225-document extraction benchmark (6–64 pages per document, ~11,000 scored fields) with adversarial cases planted throughout: cross-page values, exhaustive lists, fields that must be left null, near-miss distractors, multi-source aggregation. Scoring is deterministic exact-match against ground truth (numeric tolerance, normalized strings).

All models receive the same rendered page images, and extract each document in a single pass.

Model Size Field accuracy Full-document accuracy Median latency* Features
Datalab API 95.9% 44.4% 30.8s Citations + Verification
Gemini Flash 3.5 91.3% 40.0% 28.1s
lift 9B 90.2% 20.9% 9.5s
Azure Content Understanding 83.4% 22.2% 73.7s Citations
NuExtract3 4B 81.5% 8.4% 8.3s
Qwen3.5-9B 9B 76.32% 24.0% 16.8s

* Per document, 8 concurrent requests. Local models (lift, Qwen3.5-9B, NuExtract3) served with vLLM on a single GPU; Gemini, Datalab, and Azure via API. Latency varies with hardware and load - treat as relative, not absolute.

Latency benchmark

  • Field accuracy — fraction of individual schema fields extracted correctly.
  • Full-document accuracy — fraction of documents where every field is correct.
  • All models served with default/recommended settings from Github or Huggingface.

Hosted models with verification, citations, and confidence scores are available via the Datalab API - test in the playground.

Installation

Package

# Base install (for vLLM backend)
pip install lift-pdf

# With HuggingFace backend (includes torch, transformers)
pip install lift-pdf[hf]

# With the Schema Studio app
pip install lift-pdf[app]

# With all extras
pip install lift-pdf[all]

If you're using the HuggingFace method, we also recommend installing flash attention for better performance.

From Source

git clone https://github.com/datalab-to/lift.git
cd lift
uv sync
source .venv/bin/activate

Usage

Schemas

A schema is standard JSON Schema. Keep it simple — string, number, integer, boolean, arrays of those, arrays of objects, and nested objects are all supported. Avoid enum, anyOf/oneOf, $ref, and additionalProperties; the schema-constrained decoder skips schemas it can't compile, which weakens the output guarantee.

{
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string", "description": "Invoice identifier"},
    "total": {"type": "number", "description": "Total amount due"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "amount": {"type": "number"}
        }
      }
    }
  },
  "required": ["invoice_number", "total"]
}

Write a description for any field whose name isn't self-explanatory. Mark a field required only when it must appear; fields genuinely absent from a document come back null.

CLI

Process single files or entire directories:

# Single file, with the vLLM server (see below for how to launch it)
lift_extract input.pdf ./output --schema schema.json

# Inline JSON schema
lift_extract scans/ ./output --schema '{"type": "object", "properties": {...}}'

# A schema saved by name in the schemas/ directory, limited to some pages
lift_extract input.pdf ./output --schema invoice --page-range 0-5,8

# Process a whole directory with the local HuggingFace model
lift_extract ./documents ./output --schema schema.json --method hf

CLI Options:

  • --schema TEXT (required): a path to a JSON schema file, an inline JSON string, or the name of a saved schema in the schema library.
  • --method [hf|vllm]: inference method (default: vllm).
  • --page-range TEXT: page range for PDFs, e.g. "0-5,7,9-12" (PDFs only).
  • --max-output-tokens INTEGER: maximum number of output tokens.

Output Structure:

For each processed file, lift_extract writes to the output directory:

  • <filename>.json — the extraction matching your schema
  • <filename>_metadata.json — page count, token count, and error info (with the raw model output when extraction fails, for debugging)

Python

from lift import extract

# schema: a dict, a path to a .json file, an inline JSON string, or a library name
result = extract("document.pdf", "schema.json")
if result.extraction is not None:
    data = result.extraction  # dict matching the schema

Pass a reused model - from lift.model import InferenceManager; model=InferenceManager(method="hf") - to load weights in-process and reuse them across calls, and page_range="0-5" to limit PDF pages. Set VLLM_API_BASE to target a remote server.

Schema Studio

Launch the interactive app to build, save, and test extraction schemas against your documents (requires pip install lift-pdf[app]):

lift_app

vLLM Server

For production deployments or batch processing, launch the vLLM server:

lift_vllm                # defaults to H100 settings
lift_vllm --gpu a100-80  # tune batch settings for your GPU

This launches a Docker container with optimized inference settings, automatically scaling batch size to your GPU's VRAM. Supported GPUs: h100, a100-80, a100/a100-40, l40s, a10, l4, 4090, 3090, t4.

You can also start your own vLLM server with the datalab-to/lift model.

Configuration

Settings can be configured via environment variables or a local.env file:

# Model settings
MODEL_CHECKPOINT=datalab-to/lift
MAX_OUTPUT_TOKENS=12384
TORCH_DEVICE=cuda:0     # pin the HF backend to a device

# vLLM settings
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL_NAME=lift
VLLM_GPUS=0

Commercial usage

This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $5M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page here.

Credits

Thank you to the following open source projects:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lift_pdf-0.1.1.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lift_pdf-0.1.1-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file lift_pdf-0.1.1.tar.gz.

File metadata

  • Download URL: lift_pdf-0.1.1.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lift_pdf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f78c11957d5576029ec8ab8cb66b12d4b6ae847f41c8c1d09d7bf2b5ea6660be
MD5 ba1b73b650419283ba66886a31cd88fd
BLAKE2b-256 184c352ba3e9df2bccf41f1bb15d1745ff23ce8af6452a0324c0b7126b0eeee4

See more details on using hashes here.

File details

Details for the file lift_pdf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: lift_pdf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lift_pdf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 306112d688383e96a2ea84d324ca0259f89dac5a51bf62edfd355fc674f7e9ee
MD5 cceb0030a8152b05e03a1989b3eee745
BLAKE2b-256 dd64757ea9f599ef02a9486710aa1bc0d391b3f83aef875764c78ff6c0ad6a56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page