Extract structured data from PDFs and images by passing a schema.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vikp

These details have not been verified by PyPI

Project description

Datalab Logo

Datalab

State of the Art models for Document Intelligence

lift

lift extracts structured JSON from PDFs and images by passing a schema. It's a 9B vision model that returns a JSON object matching your schema, with schema-constrained decoding guaranteeing valid output.

lift extracting schema-aligned JSON from an invoice Extraction accuracy benchmark

Try lift on Datalab

Our managed platform runs improved extraction with higher accuracy than the open weights, plus per-field verification, citations, and confidence scores.

If you have high volume workloads, we offer a batch processing service that has processed 1B+ pages per week.

Get started with $20 in free credits per month — sign up - takes under 30 seconds - or try lift in our public playground.

Commercial self-hosting requires a license — see Commercial usage. For on-prem licensing, contact us.

Features

Extract structured data from documents
Pass any JSON schema
Handles multi-page documents in a single pass, including values that span pages
Two inference modes: local (HuggingFace) and remote (vLLM server)
CLI for single files, inline schemas, or whole directories
Schema Studio: a Streamlit app to build, save, and test schemas against your documents

Quickstart

The easiest way to start is with the CLI tools:

pip install lift-pdf

# With vLLM (recommended, lightweight install)
lift_vllm
lift_extract input.pdf ./output --schema schema.json

# With HuggingFace (requires torch)
pip install lift-pdf[hf]
lift_extract input.pdf ./output --schema schema.json --method hf

Benchmarks

Evaluated on a 225-document extraction benchmark (6–64 pages per document, ~11,000 scored fields) with adversarial cases planted throughout: cross-page values, exhaustive lists, fields that must be left null, near-miss distractors, multi-source aggregation. Scoring is deterministic exact-match against ground truth (numeric tolerance, normalized strings).

All models receive the same rendered page images, and extract each document in a single pass.

Model	Size	Field accuracy	Full-document accuracy	Median latency*	Features
Datalab API	—	95.9%	44.4%	30.8s	Citations + Verification
Gemini Flash 3.5	—	91.3%	40.0%	28.1s
lift	9B	90.2%	20.9%	9.5s
Azure Content Understanding	—	83.4%	22.2%	73.7s	Citations
NuExtract3	4B	81.5%	8.4%	8.3s
Qwen3.5-9B	9B	76.32%	24.0%	16.8s

* Per document, 8 concurrent requests. Local models (lift, Qwen3.5-9B, NuExtract3) served with vLLM on a single GPU; Gemini, Datalab, and Azure via API. Latency varies with hardware and load - treat as relative, not absolute.

Latency benchmark

Field accuracy — fraction of individual schema fields extracted correctly.
Full-document accuracy — fraction of documents where every field is correct.
All models served with default/recommended settings from Github or Huggingface.

Hosted models with verification, citations, and confidence scores are available via the Datalab API - test in the playground.

Installation

Package

# Base install (for vLLM backend)
pip install lift-pdf

# With HuggingFace backend (includes torch, transformers)
pip install lift-pdf[hf]

# With the Schema Studio app
pip install lift-pdf[app]

# With all extras
pip install lift-pdf[all]

If you're using the HuggingFace method, we also recommend installing flash attention for better performance.

From Source

git clone https://github.com/datalab-to/lift.git
cd lift
uv sync
source .venv/bin/activate

Usage

Schemas

A schema is standard JSON Schema. Keep it simple — string, number, integer, boolean, arrays of those, arrays of objects, and nested objects are all supported. Avoid enum, anyOf/oneOf, $ref, and additionalProperties; the schema-constrained decoder skips schemas it can't compile, which weakens the output guarantee.

{
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string", "description": "Invoice identifier"},
    "total": {"type": "number", "description": "Total amount due"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "amount": {"type": "number"}
        }
      }
    }
  },
  "required": ["invoice_number", "total"]
}

Write a description for any field whose name isn't self-explanatory. Mark a field required only when it must appear; fields genuinely absent from a document come back null.

CLI

Process single files or entire directories:

# Single file, with the vLLM server (see below for how to launch it)
lift_extract input.pdf ./output --schema schema.json

# Inline JSON schema
lift_extract scans/ ./output --schema '{"type": "object", "properties": {...}}'

# A schema saved by name in the schemas/ directory, limited to some pages
lift_extract input.pdf ./output --schema invoice --page-range 0-5,8

# Process a whole directory with the local HuggingFace model
lift_extract ./documents ./output --schema schema.json --method hf

CLI Options:

--schema TEXT (required): a path to a JSON schema file, an inline JSON string, or the name of a saved schema in the schema library.
--method [hf|vllm]: inference method (default: vllm).
--page-range TEXT: page range for PDFs, e.g. "0-5,7,9-12" (PDFs only).
--max-output-tokens INTEGER: maximum number of output tokens.

Output Structure:

For each processed file, lift_extract writes to the output directory:

<filename>.json — the extraction matching your schema
<filename>_metadata.json — page count, token count, and error info (with the raw model output when extraction fails, for debugging)

Python

from lift import extract

# schema: a dict, a path to a .json file, an inline JSON string, or a library name
result = extract("document.pdf", "schema.json")
if result.extraction is not None:
    data = result.extraction  # dict matching the schema

Pass a reused model - from lift.model import InferenceManager; model=InferenceManager(method="hf") - to load weights in-process and reuse them across calls, and page_range="0-5" to limit PDF pages. Set VLLM_API_BASE to target a remote server.

Schema Studio

Launch the interactive app to build, save, and test extraction schemas against your documents (requires pip install lift-pdf[app]):

lift_app

vLLM Server

For production deployments or batch processing, launch the vLLM server:

lift_vllm                # defaults to H100 settings
lift_vllm --gpu a100-80  # tune batch settings for your GPU

This launches a Docker container with optimized inference settings, automatically scaling batch size to your GPU's VRAM. Supported GPUs: h100, a100-80, a100/a100-40, l40s, a10, l4, 4090, 3090, t4.

You can also start your own vLLM server with the datalab-to/lift-extract model.

Configuration

Settings can be configured via environment variables or a local.env file:

# Model settings
MODEL_CHECKPOINT=datalab-to/lift-extract
MAX_OUTPUT_TOKENS=12384
TORCH_DEVICE=cuda:0     # pin the HF backend to a device

# vLLM settings
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL_NAME=lift
VLLM_GPUS=0

Commercial usage

This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $5M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page here.

Credits

Thank you to the following open source projects:

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vikp

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jun 19, 2026

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lift_pdf-0.1.0.tar.gz (1.3 MB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lift_pdf-0.1.0-py3-none-any.whl (33.2 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file lift_pdf-0.1.0.tar.gz.

File metadata

Download URL: lift_pdf-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lift_pdf-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`11531c859bbcb0870fb0a92eb322d86d362990e6f58ee5ab52cf52d473e02887`
MD5	`4a98c93a06aabdf5c74963fc394e2a7f`
BLAKE2b-256	`7cef6c7cc2dd8ba377e465d39af8e3cfa61c43fccd5d7dea9240f8e7e9786aeb`

See more details on using hashes here.

File details

Details for the file lift_pdf-0.1.0-py3-none-any.whl.

File metadata

Download URL: lift_pdf-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 33.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lift_pdf-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2faa0c93d82663501ebfdb5eccec36b9d279adce316affaea694ae7f8f3b58c6`
MD5	`970408b0cc8bcbd98396e06dde186cd8`
BLAKE2b-256	`46bc1619d4f98152b628a302d49b05b5fa3e993ff35b077495e4ece607eb9932`

See more details on using hashes here.

lift-pdf 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Datalab

lift

Try lift on Datalab

Features

Quickstart

Benchmarks

Installation

Package

From Source

Usage

Schemas

CLI

Python

Schema Studio

vLLM Server

Configuration

Commercial usage

Credits

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes