Extract structured data from PDFs using LLMs with sklearn-like API

These details have not been verified by PyPI

Project links

Project description

pdf-structify

Extract structured data from PDFs using LLMs with a scikit-learn-like API.

pdf-structify makes it easy to extract structured, tabular data from PDF documents using Large Language Models. It handles PDF splitting, schema detection, and data extraction with progress tracking and checkpoint/resume support.

Features

Scikit-learn-like API: Familiar fit(), transform(), fit_transform() interface
Automatic Schema Detection: Let the LLM analyze your documents and detect extractable fields
Natural Language Schema Definition: Describe what you want to extract in plain English
Progress Bars: Beautiful, informative progress tracking with rich
Checkpoint/Resume: Never lose progress - automatically resume from interruptions
Two-Layer Prompt System: Strict JSON enforcement for reliable extraction
PDF Splitting: Automatically split large PDFs into manageable chunks

Installation

pip install pdf-structify

Quick Start

3-Line Extraction

from structify import Pipeline

pipeline = Pipeline.quick_start()
results = pipeline.fit_transform("my_pdfs/")
results.to_csv("output.csv")

From Natural Language Description

from structify import Pipeline

pipeline = Pipeline.from_description("""
    Extract research findings from academic papers:
    - Author names and publication year
    - The country being studied
    - Main numerical finding (coefficient or percentage)
    - Statistical significance (p-value)
    - Methodology used (regression, RCT, etc.)
""")

results = pipeline.fit_transform("research_papers/")

With Custom Schema

from structify import Pipeline, SchemaBuilder

schema = SchemaBuilder.create(
    name="financial_metrics",
    fields=[
        {"name": "company", "type": "string", "required": True},
        {"name": "year", "type": "integer", "required": True},
        {"name": "revenue", "type": "float"},
        {"name": "profit_margin", "type": "float"},
        {"name": "sector", "type": "categorical",
         "options": ["Tech", "Finance", "Healthcare", "Energy"]}
    ],
    focus_on=["financial statements", "annual reports"],
    skip=["legal disclaimers", "boilerplate text"]
)

pipeline = Pipeline.from_schema(schema)
results = pipeline.fit_transform("annual_reports/")

Resume After Interruption

from structify import Pipeline

# If interrupted, just run again - automatically resumes!
pipeline = Pipeline.resume("my_pdfs/")
results = pipeline.transform("my_pdfs/")

Configuration

Environment Variables

export GEMINI_API_KEY="your-api-key"

Or in Code

from structify import Config

Config.set(
    gemini_api_key="your-api-key",
    pages_per_chunk=10,
    temperature=0.1,
    max_retries=5
)

Or from .env File

from structify import Config
Config.from_env()  # Loads from .env file

Components

PDFSplitter

Split large PDFs into smaller chunks:

from structify import PDFSplitter

splitter = PDFSplitter(pages_per_chunk=10)
splitter.transform("large_documents/", output_path="chunks/")

SchemaDetector

Automatically detect extractable fields:

from structify import SchemaDetector

detector = SchemaDetector(sample_ratio=0.1, max_samples=30)
schema = detector.fit_transform("documents/")
print(schema.fields)

LLMExtractor

Extract data using a schema:

from structify import LLMExtractor, Schema

extractor = LLMExtractor(schema=my_schema, deduplicate=True)
results = extractor.fit_transform("documents/")

Progress Tracking

pdf-structify provides beautiful progress bars:

╭─────────────────── Structify Pipeline ───────────────────╮
│ Stage 2/3: Data Extraction                               │
╰──────────────────────────────────────────────────────────╯
Processing papers ━━━━━━━━━━━━━━━━━ 45% 12/25 papers
  Current: "Economic_Study.pdf" part 3/8
  → Found 24 records

Output Formats

# CSV
results.to_csv("output.csv")

# JSON
results.to_json("output.json")

# Parquet
results.to_parquet("output.parquet")

# Excel
results.to_excel("output.xlsx")

Requirements

Python 3.10+
Google Gemini API key

Dependencies

google-generativeai
pypdf
rich
pydantic
pandas
python-dotenv
pyyaml

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.18

Jan 23, 2026

0.1.17

Jan 23, 2026

0.1.16

Jan 23, 2026

0.1.15

Jan 22, 2026

0.1.14

Jan 22, 2026

0.1.13

Jan 22, 2026

0.1.12

Jan 22, 2026

0.1.11

Jan 22, 2026

0.1.10

Jan 22, 2026

0.1.9

Jan 22, 2026

0.1.8

Jan 21, 2026

0.1.7

Jan 21, 2026

0.1.6

Jan 21, 2026

0.1.5

Jan 21, 2026

0.1.4

Jan 21, 2026

0.1.3

Jan 21, 2026

0.1.2

Jan 21, 2026

0.1.1

Jan 21, 2026

This version

0.1.0

Jan 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_structify-0.1.0.tar.gz (42.6 kB view details)

Uploaded Jan 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_structify-0.1.0-py3-none-any.whl (54.9 kB view details)

Uploaded Jan 21, 2026 Python 3

File details

Details for the file pdf_structify-0.1.0.tar.gz.

File metadata

Download URL: pdf_structify-0.1.0.tar.gz
Upload date: Jan 21, 2026
Size: 42.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf_structify-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`66047f7c9a0a40a2b02179f39339f81b267a2b23cf4e22969446d8fd0e09d995`
MD5	`44b29edf2ac9c5174c555d333a73acc6`
BLAKE2b-256	`f705ba176aead9303e2efab33c4756ae215ec441532d2ba599461b86b24aacf2`

See more details on using hashes here.

File details

Details for the file pdf_structify-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf_structify-0.1.0-py3-none-any.whl
Upload date: Jan 21, 2026
Size: 54.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf_structify-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e4f4084fa6b65dbdd48f26454b2b007bd07237f6e0c5eb327768eb7e79a21b8`
MD5	`c1a71bf522b082fcdf74f5b6eb299ebd`
BLAKE2b-256	`591806b420418907c8a09b8e772cb81f7f78efcf84a84770296cf7a678779ecc`

See more details on using hashes here.

pdf-structify 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf-structify

Features

Installation

Quick Start

3-Line Extraction

From Natural Language Description

With Custom Schema

Resume After Interruption

Configuration

Environment Variables

Or in Code

Or from .env File

Components

PDFSplitter

SchemaDetector

LLMExtractor

Progress Tracking

Output Formats

Requirements

Dependencies

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes