Extract structured data from PDFs using LLMs with sklearn-like API
Project description
pdf-structify
Extract structured data from PDFs using LLMs with a scikit-learn-like API.
pdf-structify makes it easy to extract structured, tabular data from PDF documents using Large Language Models. It handles PDF splitting, schema detection, and data extraction with progress tracking and checkpoint/resume support.
Features
- Scikit-learn-like API: Familiar
fit(),transform(),fit_transform()interface - Automatic Schema Detection: Let the LLM analyze your documents and detect extractable fields
- Natural Language Schema Definition: Describe what you want to extract in plain English
- Progress Bars: Beautiful, informative progress tracking with
rich - Checkpoint/Resume: Never lose progress - automatically resume from interruptions
- Two-Layer Prompt System: Strict JSON enforcement for reliable extraction
- PDF Splitting: Automatically split large PDFs into manageable chunks
Installation
pip install pdf-structify
Quick Start
3-Line Extraction
from structify import Pipeline
pipeline = Pipeline.quick_start()
results = pipeline.fit_transform("my_pdfs/")
results.to_csv("output.csv")
From Natural Language Description
from structify import Pipeline
pipeline = Pipeline.from_description("""
Extract research findings from academic papers:
- Author names and publication year
- The country being studied
- Main numerical finding (coefficient or percentage)
- Statistical significance (p-value)
- Methodology used (regression, RCT, etc.)
""")
results = pipeline.fit_transform("research_papers/")
With Custom Schema
from structify import Pipeline, SchemaBuilder
schema = SchemaBuilder.create(
name="financial_metrics",
fields=[
{"name": "company", "type": "string", "required": True},
{"name": "year", "type": "integer", "required": True},
{"name": "revenue", "type": "float"},
{"name": "profit_margin", "type": "float"},
{"name": "sector", "type": "categorical",
"options": ["Tech", "Finance", "Healthcare", "Energy"]}
],
focus_on=["financial statements", "annual reports"],
skip=["legal disclaimers", "boilerplate text"]
)
pipeline = Pipeline.from_schema(schema)
results = pipeline.fit_transform("annual_reports/")
Resume After Interruption
from structify import Pipeline
# If interrupted, just run again - automatically resumes!
pipeline = Pipeline.resume("my_pdfs/")
results = pipeline.transform("my_pdfs/")
Configuration
Environment Variables
export GEMINI_API_KEY="your-api-key"
Or in Code
from structify import Config
Config.set(
gemini_api_key="your-api-key",
pages_per_chunk=10,
temperature=0.1,
max_retries=5
)
Or from .env File
from structify import Config
Config.from_env() # Loads from .env file
Components
PDFSplitter
Split large PDFs into smaller chunks:
from structify import PDFSplitter
splitter = PDFSplitter(pages_per_chunk=10)
splitter.transform("large_documents/", output_path="chunks/")
SchemaDetector
Automatically detect extractable fields:
from structify import SchemaDetector
detector = SchemaDetector(sample_ratio=0.1, max_samples=30)
schema = detector.fit_transform("documents/")
print(schema.fields)
LLMExtractor
Extract data using a schema:
from structify import LLMExtractor, Schema
extractor = LLMExtractor(schema=my_schema, deduplicate=True)
results = extractor.fit_transform("documents/")
Progress Tracking
pdf-structify provides beautiful progress bars:
╭─────────────────── Structify Pipeline ───────────────────╮
│ Stage 2/3: Data Extraction │
╰──────────────────────────────────────────────────────────╯
Processing papers ━━━━━━━━━━━━━━━━━ 45% 12/25 papers
Current: "Economic_Study.pdf" part 3/8
→ Found 24 records
Output Formats
# CSV
results.to_csv("output.csv")
# JSON
results.to_json("output.json")
# Parquet
results.to_parquet("output.parquet")
# Excel
results.to_excel("output.xlsx")
Requirements
- Python 3.10+
- Google Gemini API key
Dependencies
- google-generativeai
- pypdf
- rich
- pydantic
- pandas
- python-dotenv
- pyyaml
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_structify-0.1.8.tar.gz.
File metadata
- Download URL: pdf_structify-0.1.8.tar.gz
- Upload date:
- Size: 49.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afe1d7a16cf49ffde5b11dfcdb4c348d3324666b06fa95b12b5f009a117fd21d
|
|
| MD5 |
b4cee0d5a0f33fcf0e31c4f809956c68
|
|
| BLAKE2b-256 |
01bd94a741f40be384bb4580c3e20f1927ac8fb6b667e4c20c8d5b2cd46d198d
|
File details
Details for the file pdf_structify-0.1.8-py3-none-any.whl.
File metadata
- Download URL: pdf_structify-0.1.8-py3-none-any.whl
- Upload date:
- Size: 61.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5da6b86de5059fe2daf2906759be7f536e9fcadf2a297f3e64dce11320c2bef
|
|
| MD5 |
28781bca68102e2e3a56c5cb8d1a8b60
|
|
| BLAKE2b-256 |
42cc9d316defa3f2d847b1fa8659dd80319605c228f75745ec21576b4a06ff35
|