Skip to main content

Structured data extraction with LLM majority vote

Project description

Extrai

Extrai Logo

Python CI/CD codecov Python 3.12 MIT License

Documentation

📖 Description

With extrai, you can extract data from text documents with LLMs, which will be formatted into a given SQLModel and registered in your database.

The core of the library is its Consensus Mechanism. We make the same request multiple times, using the same or different providers, and then select the values that meet a certain threshold.

extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.

✨ Key Features

📚 Documentation

For a complete guide, please see the full documentation. Here are the key sections:

⚙️ Worflow Overview

The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview):

graph TD
    %% Define styles for different stages for better colors
    classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
    classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
    classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
    classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87

    subgraph "Inputs (Static Mode)"
        A["📄<br/>Documents"]
        B["🏛️<br/>SQLAlchemy Models"]
        L1["🤖<br/>LLM"]
    end

    subgraph "Inputs (Dynamic Mode)"
        C["📋<br/>Task Description<br/>(User Prompt)"]
        D["📚<br/>Example Documents"]
        L2["🤖<br/>LLM"]
    end

    subgraph "Model Generation<br/>(Optional)"
        MG("🔧<br/>Generate SQLModels<br/>via LLM")
    end

    subgraph "Data Extraction"
        EG("📝<br/>Example Generation<br/>(Optional)")
        P("✍️<br/>Prompt Generation")
        
        subgraph "LLM Extraction Revisions"
            direction LR
            E1("🤖<br/>Revision 1")
            H1("💧<br/>SQLAlchemy Hydration 1")
            E2("🤖<br/>Revision 2")
            H2("💧<br/>SQLAlchemy Hydration 2")
            E3("🤖<br/>...")
            H3("💧<br/>...")
        end
        
        F("🤝<br/>JSON Consensus")
        H("💧<br/>SQLAlchemy Hydration")
    end

    subgraph Outputs
        SM["🏛️<br/>Generated SQLModels<br/>(Optional)"]
        O["✅<br/>Hydrated Objects"]
        DB("💾<br/>Database Persistence<br/>(Optional)")
    end

    %% Connections for Static Mode
    L1 --> P
    A --> P
    B --> EG
    EG --> P
    P --> E1
    P --> E2
    P --> E3
    E1 --> H1
    E2 --> H2
    E3 --> H3
    H1 --> F
    H2 --> F
    H3 --> F
    F --> H
    H --> O
    H --> DB

    %% Connections for Dynamic Mode
    L2 --> MG
    C --> MG
    D --> MG
    MG --> EG
    EG --> P

    MG --> SM

    %% Apply styles
    class A,B,C,D,L1,L2 inputStyle;
    class P,E1,E2,E3,H,EG processStyle;
    class F consensusStyle;
    class O,DB,SM outputStyle;
    class MG modelGenStyle;

▶️ Getting Started

📦 Installation

Install the library from PyPI:

pip install extrai

✨ Usage Example

For a more detailed guide, please see the Getting Started Tutorial.

Here is a minimal example:

import asyncio
from typing import Optional
from sqlmodel import Field, SQLModel, create_engine, Session
from extrai.core import WorkflowOrchestrator
from extrai.llm_providers.huggingface_client import HuggingFaceClient

# 1. Define your data model
class Product(SQLModel, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    name: str
    price: float

# 2. Set up the orchestrator
llm_client = HuggingFaceClient(api_key="YOUR_HF_API_KEY")
engine = create_engine("sqlite:///:memory:")
orchestrator = WorkflowOrchestrator(
    llm_client=llm_client,
    db_engine=engine,
    root_model=Product,
)

# 3. Run the extraction and verify
text = "The new SuperWidget costs $99.99."
with Session(engine) as session:
    asyncio.run(orchestrator.synthesize_and_save([text], db_session=session))
    product = session.query(Product).first()
    print(product)
    # Expected: name='SuperWidget' price=99.99 id=1

🚀 More Examples

For more in-depth examples, see the /examples directory in the repository.

🙌 Contributing

We welcome contributions! Please see the Contributing Guide for details on how to set up your development environment, run tests, and submit a pull request.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extrai_workflow-1.0.0.tar.gz (51.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extrai_workflow-1.0.0-py3-none-any.whl (58.0 kB view details)

Uploaded Python 3

File details

Details for the file extrai_workflow-1.0.0.tar.gz.

File metadata

  • Download URL: extrai_workflow-1.0.0.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for extrai_workflow-1.0.0.tar.gz
Algorithm Hash digest
SHA256 daea02abcfdb2d03dfc4fe113f17a0c0b663ff58514d15bed0d455c4a850249b
MD5 03835c08bab42bd7ff556c426fa26441
BLAKE2b-256 39de6e0eafdaafc4177f09d16a1d41dc8d64cb2d8ee7dd868cb85dcd7889b6aa

See more details on using hashes here.

File details

Details for the file extrai_workflow-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for extrai_workflow-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c536d15666363eb42a43901c387f122b4db88e10df322a5c34d424595869e8a7
MD5 8e674fa60c69fbcc02de60cfbdc68f28
BLAKE2b-256 2ea1c59d3c4dbb7f70b84923b07aa7dc4d16611073fd9700f36155c23ba12dd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page