Evidence-backed structured extraction from documents

These details have not been verified by PyPI

Project links

Project description

Pullcite

Evidence-backed structured extraction from documents.

Pullcite extracts structured data from documents using LLMs while providing proof of where each value came from in the source (quote + page + bounding box).

Define Django-style schemas where each field specifies its own search query. Pullcite searches the document per-field, providing relevant context to the LLM, then verifies extracted values against the source.

Installation

# Core package
pip install pullcite

# With providers
pip install pullcite[anthropic]      # Anthropic Claude
pip install pullcite[openai]         # OpenAI GPT
pip install pullcite[tantivy]        # High-performance BM25 search
pip install pullcite[voyage]         # Voyage AI embeddings (for semantic search)
pip install pullcite[docling]        # PDF/DOCX parsing with coordinates

# Everything
pip install pullcite[all]

Quick Start

from pullcite import (
    Document,
    ExtractionSchema,
    Extractor,
    DecimalField,
    StringField,
    SearchType,
    BM25Searcher,
)
from pullcite.llms.anthropic import AnthropicLLM

# 1. Define your schema with field-level search queries
class Invoice(ExtractionSchema):
    vendor = StringField(
        query="vendor company name supplier",
        search_type=SearchType.BM25,
        description="Company that issued the invoice",
    )
    total = DecimalField(
        query="total amount due grand total",
        search_type=SearchType.BM25,
        description="Total amount due",
    )

# 2. Create extractor
extractor = Extractor(
    schema=Invoice,
    llm=AnthropicLLM(),
    searcher=BM25Searcher(),
)

# 3. Extract with evidence
doc = Document.from_file("invoice.pdf")
result = extractor.extract(doc)

# 4. Access data and evidence
print(result.data.total)         # Decimal("1500.00")
print(result.data.vendor)        # "Acme Corp"
print(result.status)             # ExtractionStatus.VERIFIED

evidence = result.evidence_map["total"]
print(evidence.quote)            # "Grand Total: $1,500.00"
print(evidence.page)             # 1
print(evidence.bbox)             # (72.0, 540.2, 200.5, 555.8)

Health Insurance Example

from decimal import Decimal
from pullcite import (
    Document,
    ExtractionSchema,
    Extractor,
    StringField,
    DecimalField,
    PercentField,
    BooleanField,
    SearchType,
    BM25Searcher,
)
from pullcite.llms.anthropic import AnthropicLLM


class HealthPlan(ExtractionSchema):
    """Health insurance plan extraction schema."""

    plan_name = StringField(
        query="plan name health plan title",
        search_type=SearchType.BM25,
    )
    plan_type = StringField(
        query="plan type HMO PPO EPO",
        search_type=SearchType.BM25,
    )

    # Deductibles
    individual_deductible = DecimalField(
        query="individual deductible annual",
        search_type=SearchType.BM25,
    )
    family_deductible = DecimalField(
        query="family deductible annual",
        search_type=SearchType.BM25,
    )

    # Out-of-pocket
    individual_oop_max = DecimalField(
        query="individual out-of-pocket maximum",
        search_type=SearchType.BM25,
    )
    family_oop_max = DecimalField(
        query="family out-of-pocket maximum",
        search_type=SearchType.BM25,
    )

    # Copays
    pcp_copay = DecimalField(
        query="primary care physician copay PCP",
        search_type=SearchType.BM25,
    )
    specialist_copay = DecimalField(
        query="specialist copay",
        search_type=SearchType.BM25,
    )
    er_copay = DecimalField(
        query="emergency room ER copay",
        search_type=SearchType.BM25,
    )

    # Coinsurance
    coinsurance = PercentField(
        query="coinsurance percentage member pays",
        search_type=SearchType.BM25,
    )

    # Prescriptions
    generic_rx = DecimalField(
        query="generic prescription drug copay tier 1",
        search_type=SearchType.BM25,
        required=False,
    )

    # Coverage
    preventive_covered = BooleanField(
        query="preventive care covered no cost",
        search_type=SearchType.BM25,
        required=False,
    )


# Create extractor with custom instructions
extractor = Extractor(
    schema=HealthPlan,
    llm=AnthropicLLM(model="claude-sonnet-4-20250514"),
    searcher=BM25Searcher(),
    extra_instructions="""
    - Extract IN-NETWORK values when both in/out-of-network are shown
    - Coinsurance is what the MEMBER pays, not the plan
    - "No charge" or "Covered in full" means $0
    """,
)

# Extract
doc = Document.from_file("summary_of_benefits.pdf")
result = extractor.extract(doc)

# Print results
print(f"Plan: {result.data.plan_name} ({result.data.plan_type})")
print(f"Individual Deductible: ${result.data.individual_deductible}")
print(f"PCP Copay: ${result.data.pcp_copay}")
print(f"Coinsurance: {result.data.coinsurance}%")
print(f"Status: {result.status}")

Field Types

Field Type	Python Type	Parses	Example
`StringField`	`str`	Text	`"Acme Corp"`
`IntegerField`	`int`	Integers	`100`, `"100 days"`
`FloatField`	`float`	Decimals	`3.14`
`DecimalField`	`Decimal`	Currency	`"$1,500.00"` → `Decimal("1500.00")`
`CurrencyField`	`Decimal`	Currency	Same as Decimal, with symbol handling
`PercentField`	`float`	Percentages	`"30%"`, `0.30` → `30.0`
`BooleanField`	`bool`	Yes/No	`"yes"`, `"true"`, `1` → `True`
`DateField`	`str`	Dates	`"2024-01-15"`
`ListField`	`list`	Arrays	`["a", "b", "c"]`
`EnumField`	`str`	Choices	Must be one of defined choices

Search Types

Each field specifies how to search for evidence:

class Document(ExtractionSchema):
    # BM25: Keyword search (fast, no embeddings needed)
    invoice_number = StringField(
        query="invoice number invoice #",
        search_type=SearchType.BM25,
    )

    # Semantic: Vector similarity (requires embeddings)
    description = StringField(
        query="product service description",
        search_type=SearchType.SEMANTIC,
    )

    # Hybrid: Combined BM25 + semantic with rank fusion
    vendor = StringField(
        query="vendor company supplier",
        search_type=SearchType.HYBRID,
    )

For semantic/hybrid search, provide a retriever:

from pullcite.embeddings.openai import OpenAIEmbedder
from pullcite.retrieval.memory import MemoryRetriever

extractor = Extractor(
    schema=MySchema,
    llm=my_llm,
    searcher=BM25Searcher(),
    retriever=MemoryRetriever(OpenAIEmbedder()),  # For semantic fields
)

Custom Prompts

Extra Instructions (append to default prompt)

extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    extra_instructions="""
    - All amounts are in USD
    - Dates should be YYYY-MM-DD format
    - Use the value from "Grand Total", not subtotals
    """,
)

Full System Prompt (replace default)

extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    system_prompt="""You are an expert invoice parser.
    Extract all fields precisely. Be careful with:
    - Currency formatting
    - Tax calculations
    - Line item totals vs grand total
    """,
)

Custom Prompt Builder (full control)

def my_prompt_builder(schema, field_contexts):
    """Build custom prompt with access to schema and retrieved contexts."""
    lines = ["Extract these fields:"]
    for name, field in schema.get_fields().items():
        contexts = field_contexts.get(name, [])
        lines.append(f"\n## {name}")
        if contexts:
            lines.append(f"Found in: {contexts[0].text[:200]}")
    return "\n".join(lines)

extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    prompt_builder=my_prompt_builder,
)

Handling Large Documents

For documents that exceed context limits, use batching:

extractor = Extractor(
    schema=LargeSchema,  # 50+ fields
    llm=my_llm,
    searcher=BM25Searcher(),

    # Batching options
    max_fields_per_batch=10,       # Max fields per LLM call
    max_context_chars=50000,       # Max context chars per batch

    # Skip full document text (use only retrieved excerpts)
    include_document_text=False,
    top_k=10,                      # More chunks per field
)

This splits extraction into multiple LLM calls, each handling a subset of fields with their relevant context.

Evidence

Every extracted value can be traced to the source:

result = extractor.extract(document)

for field_name, evidence in result.evidence_map.items():
    print(f"{field_name}:")
    print(f"  Value: {evidence.value}")
    print(f"  Quote: {evidence.quote}")
    print(f"  Page: {evidence.page}")
    print(f"  Bounding Box: {evidence.bbox}")  # (x0, y0, x1, y1) in PDF points
    print(f"  Confidence: {evidence.confidence:.0%}")
    print(f"  Verified: {evidence.verified}")

Verification Status

from pullcite import ExtractionStatus

result = extractor.extract(document)

if result.status == ExtractionStatus.VERIFIED:
    print("All fields verified against source")
elif result.status == ExtractionStatus.PARTIAL:
    print("Some fields could not be verified")
elif result.status == ExtractionStatus.FAILED:
    print("Extraction failed")

# Check individual field results
for vr in result.verification_results:
    print(f"{vr.path}: {vr.status.value}")

LLM Providers

Anthropic Claude

from pullcite.llms.anthropic import AnthropicLLM

llm = AnthropicLLM(
    api_key="...",  # Or ANTHROPIC_API_KEY env var
    model="claude-sonnet-4-20250514",
)

OpenAI GPT

from pullcite.llms.openai import OpenAILLM

llm = OpenAILLM(
    api_key="...",  # Or OPENAI_API_KEY env var
    model="gpt-4o",
)

Project Structure

pullcite/
├── __init__.py          # Main exports
├── core/
│   ├── document.py      # Document loading + chunking
│   ├── chunk.py         # Chunk dataclass
│   ├── evidence.py      # Evidence, VerificationResult
│   └── result.py        # ExtractionResult, stats
├── schema/
│   ├── base.py          # ExtractionSchema, Field, SearchType
│   ├── fields.py        # StringField, DecimalField, etc.
│   └── extractor.py     # SchemaExtractor (Extractor)
├── search/
│   ├── base.py          # Searcher ABC, SearchResult
│   ├── bm25.py          # BM25Searcher (tantivy)
│   └── hybrid.py        # HybridSearcher
├── embeddings/
│   ├── openai.py        # OpenAI embeddings
│   ├── voyage.py        # Voyage AI embeddings
│   └── local.py         # Sentence Transformers
├── retrieval/
│   ├── memory.py        # In-memory vector store
│   ├── chroma.py        # ChromaDB
│   └── pgvector.py      # PostgreSQL pgvector
└── llms/
    ├── anthropic.py     # Anthropic Claude
    └── openai.py        # OpenAI GPT

Environment Variables

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export VOYAGE_API_KEY="..."

Development

git clone https://github.com/usercando/pullcite
cd pullcite
pip install -e ".[dev]"

pytest tests/ -v

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.7

Jan 23, 2026

0.2.6

Jan 23, 2026

0.2.5

Jan 23, 2026

0.2.4

Jan 6, 2026

0.2.3

Jan 6, 2026

0.2.2

Jan 6, 2026

0.2.1

Jan 5, 2026

0.2.0

Jan 5, 2026

This version

0.1.0

Jan 4, 2026

0.0.6

Jan 3, 2026

0.0.5

Jan 3, 2026

0.0.4

Jan 3, 2026

0.0.3

Jan 3, 2026

0.0.2

Jan 2, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pullcite-0.1.0.tar.gz (419.9 kB view details)

Uploaded Jan 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pullcite-0.1.0-py3-none-any.whl (101.1 kB view details)

Uploaded Jan 4, 2026 Python 3

File details

Details for the file pullcite-0.1.0.tar.gz.

File metadata

Download URL: pullcite-0.1.0.tar.gz
Upload date: Jan 4, 2026
Size: 419.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pullcite-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9185fd8967ebf54ea6775c338d556256ab68df48a1b4f0ab1c859082ef8ccd6d`
MD5	`d678e29b3fdb0548000be1cbeb01d194`
BLAKE2b-256	`ec0597fadd318b062e2855a92f2fa23a8ba940e88b94e32ee6edf2f50039a798`

See more details on using hashes here.

File details

Details for the file pullcite-0.1.0-py3-none-any.whl.

File metadata

Download URL: pullcite-0.1.0-py3-none-any.whl
Upload date: Jan 4, 2026
Size: 101.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pullcite-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`744fd1c84477f55d1ead0cd2a894774e7d6acf31cecbeb69bafeffbb45232864`
MD5	`f66290ba53c4be96b41ae5eb4692750a`
BLAKE2b-256	`d8ee95de9d29a8a6585f09a6a973523ac742615251997513707eee54e7186ed5`

See more details on using hashes here.

pullcite 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pullcite

Installation

Quick Start

Health Insurance Example

Field Types

Search Types

Custom Prompts

Extra Instructions (append to default prompt)

Full System Prompt (replace default)

Custom Prompt Builder (full control)

Handling Large Documents

Evidence

Verification Status

LLM Providers

Anthropic Claude

OpenAI GPT

Project Structure

Environment Variables

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes