Skip to main content

Extract structured data from PDFs using LLMs with sklearn-like API

Project description

pdf-structify

PyPI version Python 3.10+ License: MIT

Extract structured data from PDFs using LLMs with a scikit-learn-like API.

pdf-structify makes it easy to extract structured, tabular data from PDF documents using Large Language Models. It handles PDF splitting, schema detection, and data extraction with progress tracking and checkpoint/resume support.

Features

  • Scikit-learn-like API: Familiar fit(), transform(), fit_transform() interface
  • Automatic Schema Detection: Let the LLM analyze your documents and detect extractable fields
  • Natural Language Schema Definition: Describe what you want to extract in plain English
  • Progress Bars: Beautiful, informative progress tracking with rich
  • Checkpoint/Resume: Never lose progress - automatically resume from interruptions
  • Two-Layer Prompt System: Strict JSON enforcement for reliable extraction
  • PDF Splitting: Automatically split large PDFs into manageable chunks

Installation

pip install pdf-structify

Quick Start

3-Line Extraction

from structify import Pipeline

pipeline = Pipeline.quick_start()
results = pipeline.fit_transform("my_pdfs/")
results.to_csv("output.csv")

From Natural Language Description

from structify import Pipeline

pipeline = Pipeline.from_description("""
    Extract research findings from academic papers:
    - Author names and publication year
    - The country being studied
    - Main numerical finding (coefficient or percentage)
    - Statistical significance (p-value)
    - Methodology used (regression, RCT, etc.)
""")

results = pipeline.fit_transform("research_papers/")

With Custom Schema

from structify import Pipeline, SchemaBuilder

schema = SchemaBuilder.create(
    name="financial_metrics",
    fields=[
        {"name": "company", "type": "string", "required": True},
        {"name": "year", "type": "integer", "required": True},
        {"name": "revenue", "type": "float"},
        {"name": "profit_margin", "type": "float"},
        {"name": "sector", "type": "categorical",
         "options": ["Tech", "Finance", "Healthcare", "Energy"]}
    ],
    focus_on=["financial statements", "annual reports"],
    skip=["legal disclaimers", "boilerplate text"]
)

pipeline = Pipeline.from_schema(schema)
results = pipeline.fit_transform("annual_reports/")

Resume After Interruption

from structify import Pipeline

# If interrupted, just run again - automatically resumes!
pipeline = Pipeline.resume("my_pdfs/")
results = pipeline.transform("my_pdfs/")

Configuration

Environment Variables

export GEMINI_API_KEY="your-api-key"

Or in Code

from structify import Config

Config.set(
    gemini_api_key="your-api-key",
    pages_per_chunk=10,
    temperature=0.1,
    max_retries=5
)

Or from .env File

from structify import Config
Config.from_env()  # Loads from .env file

Components

PDFSplitter

Split large PDFs into smaller chunks:

from structify import PDFSplitter

splitter = PDFSplitter(pages_per_chunk=10)
splitter.transform("large_documents/", output_path="chunks/")

SchemaDetector

Automatically detect extractable fields:

from structify import SchemaDetector

detector = SchemaDetector(sample_ratio=0.1, max_samples=30)
schema = detector.fit_transform("documents/")
print(schema.fields)

LLMExtractor

Extract data using a schema:

from structify import LLMExtractor, Schema

extractor = LLMExtractor(schema=my_schema, deduplicate=True)
results = extractor.fit_transform("documents/")

Progress Tracking

pdf-structify provides beautiful progress bars:

╭─────────────────── Structify Pipeline ───────────────────╮
│ Stage 2/3: Data Extraction                               │
╰──────────────────────────────────────────────────────────╯
Processing papers ━━━━━━━━━━━━━━━━━ 45% 12/25 papers
  Current: "Economic_Study.pdf" part 3/8
  → Found 24 records

Output Formats

# CSV
results.to_csv("output.csv")

# JSON
results.to_json("output.json")

# Parquet
results.to_parquet("output.parquet")

# Excel
results.to_excel("output.xlsx")

Requirements

  • Python 3.10+
  • Google Gemini API key

Dependencies

  • google-generativeai
  • pypdf
  • rich
  • pydantic
  • pandas
  • python-dotenv
  • pyyaml

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_structify-0.1.15.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_structify-0.1.15-py3-none-any.whl (65.0 kB view details)

Uploaded Python 3

File details

Details for the file pdf_structify-0.1.15.tar.gz.

File metadata

  • Download URL: pdf_structify-0.1.15.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf_structify-0.1.15.tar.gz
Algorithm Hash digest
SHA256 8c7b2fa354ed9a0dc95953795b0d39840917c20615a1b7987e9f1c7a44ad385e
MD5 982d6d701d710311bb5121c1d07669f8
BLAKE2b-256 ecf4928feaa79a488381db1783f1a2a8758aa4d6824c2aeae1fc0573d1d8dfe4

See more details on using hashes here.

File details

Details for the file pdf_structify-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: pdf_structify-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 65.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf_structify-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 2c35ca83e7a4b53781a7cbe4e07c22809c6fc1ba8184014dc5e3eba16bc14c49
MD5 6d51d68673552ea934e34121d8a6a4a1
BLAKE2b-256 914b746a3f1505c0ead7f48e28884def9c6ef26ad889d5343fc10a2f0f39158e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page