Structured data extraction with LLM majority vote
Project description
Extrai
📖 Description
extrai extracts data from text documents using LLMs, formatting the output into a given SQLModel and registering it in a database.
The library utilizes a Consensus Mechanism to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.
extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.
✨ Key Features
- Consensus Mechanism: Consolidates multiple LLM outputs to improve extraction accuracy.
- Dynamic SQLModel Generation: Generates
SQLModelschemas from natural language descriptions. - Hierarchical Extraction: Handles complex, nested data by breaking down the extraction into manageable, hierarchical steps.
- Extensible LLM Support: Integrates with various LLM providers through a client interface.
- Built-in Analytics: Collects metrics on LLM performance and output quality to refine prompts and monitor errors.
- Workflow Orchestration: A central orchestrator to manage the extraction pipeline.
- Example JSON Generation: Automatically generate few-shot examples to improve extraction quality.
- Customizable Prompts: Customize prompts at runtime to tailor the extraction process to specific needs.
- Rotating LLMs providers: Create the JSON revisions from multiple LLM providers.
📚 Documentation
For a complete guide, please see the full documentation. Here are the key sections:
- Getting Started
- How-to Guides
- Core Concepts
- Reference
- API Reference
- Community
⚙️ Workflow Overview
The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview):
graph TD
%% Define styles for different stages for better colors
classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87
subgraph "Inputs (Static Mode)"
A["📄<br/>Documents"]
B["🏛️<br/>SQLAlchemy Models"]
L1["🤖<br/>LLM"]
end
subgraph "Inputs (Dynamic Mode)"
C["📋<br/>Task Description<br/>(User Prompt)"]
D["📚<br/>Example Documents"]
L2["🤖<br/>LLM"]
end
subgraph "Model Generation<br/>(Optional)"
MG("🔧<br/>Generate SQLModels<br/>via LLM")
end
subgraph "Data Extraction"
EG("📝<br/>Example Generation<br/>(Optional)")
P("✍️<br/>Prompt Generation")
subgraph "LLM Extraction Revisions"
direction LR
E1("🤖<br/>Revision 1")
H1("💧<br/>SQLAlchemy Hydration 1")
E2("🤖<br/>Revision 2")
H2("💧<br/>SQLAlchemy Hydration 2")
E3("🤖<br/>...")
H3("💧<br/>...")
end
F("🤝<br/>JSON Consensus")
H("💧<br/>SQLAlchemy Hydration")
end
subgraph Outputs
SM["🏛️<br/>Generated SQLModels<br/>(Optional)"]
O["✅<br/>Hydrated Objects"]
DB("💾<br/>Database Persistence<br/>(Optional)")
end
%% Connections for Static Mode
L1 --> P
A --> P
B --> EG
EG --> P
P --> E1
P --> E2
P --> E3
E1 --> H1
E2 --> H2
E3 --> H3
H1 --> F
H2 --> F
H3 --> F
F --> H
H --> O
H --> DB
%% Connections for Dynamic Mode
L2 --> MG
C --> MG
D --> MG
MG --> EG
EG --> P
MG --> SM
%% Apply styles
class A,B,C,D,L1,L2 inputStyle;
class P,E1,E2,E3,H,EG processStyle;
class F consensusStyle;
class O,DB,SM outputStyle;
class MG modelGenStyle;
▶️ Getting Started
📦 Installation
Install the library from PyPI:
pip install extrai-workflow
✨ Usage Example
For a more detailed guide, please see the Getting Started Tutorial.
Here is a minimal example:
import asyncio
from typing import Optional
from sqlmodel import Field, SQLModel, create_engine, Session
from extrai.core import WorkflowOrchestrator
from extrai.llm_providers.huggingface_client import HuggingFaceClient
# 1. Define your data model
class Product(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
name: str
price: float
# 2. Set up the orchestrator
llm_client = HuggingFaceClient(api_key="YOUR_HF_API_KEY")
engine = create_engine("sqlite:///:memory:")
orchestrator = WorkflowOrchestrator(
llm_client=llm_client,
db_engine=engine,
root_model=Product,
)
# 3. Run the extraction and verify
text = "The new SuperWidget costs $99.99."
with Session(engine) as session:
asyncio.run(orchestrator.synthesize_and_save([text], db_session=session))
product = session.query(Product).first()
print(product)
# Expected: name='SuperWidget' price=99.99 id=1
🚀 More Examples
For more in-depth examples, see the /examples directory in the repository.
🙌 Contributing
We welcome contributions! Please see the Contributing Guide for details on how to set up your development environment, run tests, and submit a pull request.
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extrai_workflow-1.1.0.tar.gz.
File metadata
- Download URL: extrai_workflow-1.1.0.tar.gz
- Upload date:
- Size: 96.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3cf39e3411fa38a784f3e62ccdcc0ec0ce81cd3ab7a0bac6486a5f7589eff4a
|
|
| MD5 |
eb5fedcb77f842b0ad200514572395b1
|
|
| BLAKE2b-256 |
4860e067ca6345d9e3945e3a18aea94a4fcdb7558d6fc1f5ae16406dbbedcf33
|
File details
Details for the file extrai_workflow-1.1.0-py3-none-any.whl.
File metadata
- Download URL: extrai_workflow-1.1.0-py3-none-any.whl
- Upload date:
- Size: 122.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a45b15f7195fa87aaa96877d7c44e80a0341fb7ad3e1116358e15b296a7978c
|
|
| MD5 |
5e07e11028842865c47c3abc40ecd5fa
|
|
| BLAKE2b-256 |
cf0476e77016d83aad33caa77f0f6509ae181a70a5a166fc3cbca29091ea8eea
|