Skip to main content

Evidence-backed structured extraction from documents

Project description

Pullcite

Alpha — API may change. Expect breaking changes until v1.0.

Evidence-backed structured extraction from documents.

Extract structured data from documents using LLMs while providing proof of where each value came from (quote + page + bounding box).

from pullcite import Document, Extractor, ExtractionSchema, StringField, DecimalField, BM25Searcher
from pullcite.llms.anthropic import AnthropicLLM

class Invoice(ExtractionSchema):
    vendor = StringField(query="vendor company name", description="Company that issued the invoice")
    total = DecimalField(query="total amount due", description="Final amount due")

extractor = Extractor(schema=Invoice, llm=AnthropicLLM(), searcher=BM25Searcher())
result = extractor.extract(Document.from_file("invoice.pdf"))

print(result.data.vendor)                # "Acme Corp"
print(result.data.total)                 # Decimal("1500.00")
print(result.evidence_map["total"].quote) # "Grand Total: $1,500.00"
print(result.evidence_map["total"].page)  # 1

Installation

pip install pullcite                 # Core
pip install pullcite[anthropic]      # + Claude
pip install pullcite[openai]         # + GPT
pip install pullcite[docling]        # + PDF/DOCX parsing
pip install pullcite[all]            # Everything

Defining Schemas

Each field specifies a search query and description:

from pullcite import ExtractionSchema, StringField, DecimalField, PercentField, BooleanField

class HealthPlan(ExtractionSchema):
    plan_name = StringField(
        query="plan name health plan title",
        description="Official name of the health insurance plan",
    )
    individual_deductible = DecimalField(
        query="individual deductible annual",
        description="Annual deductible for individual coverage (in-network)",
    )
    coinsurance = PercentField(
        query="coinsurance percentage member pays",
        description="Percentage the member pays after deductible (not plan's share)",
    )
    preventive_covered = BooleanField(
        query="preventive care covered",
        description="Whether preventive care is covered at no cost",
        required=False,
    )

Field Types

Type Python Parses
StringField str Text
IntegerField int 100, "100 days"
DecimalField Decimal "$1,500.00"1500.00
PercentField float "30%", 0.3030.0
BooleanField bool "yes", "true", 1True
DateField str "2024-01-15"
ListField list ["a", "b"]
EnumField str Must match choices

Descriptions Matter

Descriptions tell the LLM what each field means. Without them, extraction is less accurate:

# Good - LLM understands context
total = DecimalField(query="total", description="Final invoice total including tax")

# Works but less accurate - LLM only sees field name "total"
total = DecimalField(query="total")

Chunking

Chunking controls how documents are split for search. Critical for extraction quality.

from pullcite import Document, SlidingWindowChunker, SentenceChunker

# Explicit chunking (recommended)
doc = Document.from_file("report.pdf", chunker=SlidingWindowChunker(size=500, stride=250))

# Sentence-aware chunking (better for prose)
doc = Document.from_file("contract.pdf", chunker=SentenceChunker(target_size=1000, overlap=200))

# Default: SentenceChunker(target_size=1200, overlap=200)
doc = Document.from_file("invoice.pdf")
Chunker Use Case
SlidingWindowChunker(size, stride) Predictable, good default
SentenceChunker(target_size, overlap) Prose documents
ParagraphChunker(target_size, overlap_paragraphs) Well-structured docs

Extraction

from pullcite import Extractor, BM25Searcher
from pullcite.llms.anthropic import AnthropicLLM

extractor = Extractor(
    schema=HealthPlan,
    llm=AnthropicLLM(),
    searcher=BM25Searcher(),
    top_k=5,         # Chunks per field (tune for your docs)
    verify=True,     # Verify against source (default)
)

result = extractor.extract(doc)

top_k Parameter

Controls chunks retrieved per field. Critical for quality.

  • top_k=5 (default) - Good starting point
  • Higher = more context, better accuracy, more tokens
  • Lower = faster, cheaper, may miss context

Async Support

import asyncio

results = await asyncio.gather(*[
    extractor.extract_async(doc) for doc in documents
])

Verification

Pullcite verifies extracted values against the source document:

  1. Search - Retrieve chunks using field's query
  2. Parse - Find values in chunk text via field.parse_from_text()
  3. Compare - Check match via field.compare() (type-aware: tolerances for decimals, case-insensitive for strings)
print(result.status)  # VERIFIED, PARTIAL, or FAILED

for vr in result.verification_results:
    print(f"{vr.path}: {vr.status.value}")
    # MATCH - Value verified in source
    # MISMATCH - Found different value
    # NOT_FOUND - Required field missing
    # SKIPPED - No context to verify

Evidence

Every verified field has traceable evidence:

evidence = result.evidence_map["total"]
print(evidence.quote)       # "Grand Total: $1,500.00"
print(evidence.page)        # 1
print(evidence.bbox)        # (72.0, 540.2, 200.5, 555.8)
print(evidence.confidence)  # 0.95
print(evidence.verified)    # True

Complete Example

from pullcite import (
    Document, Extractor, ExtractionSchema, BM25Searcher,
    StringField, DecimalField, PercentField, BooleanField,
    SentenceChunker,
)
from pullcite.llms.anthropic import AnthropicLLM


class HealthPlan(ExtractionSchema):
    plan_name = StringField(
        query="plan name health plan title",
        description="Official name of the health insurance plan",
    )
    plan_type = StringField(
        query="plan type HMO PPO EPO",
        description="Type of plan: HMO, PPO, EPO, or POS",
    )
    individual_deductible = DecimalField(
        query="individual deductible annual",
        description="Annual deductible for individual coverage (in-network)",
    )
    family_deductible = DecimalField(
        query="family deductible annual",
        description="Annual deductible for family coverage (in-network)",
    )
    coinsurance = PercentField(
        query="coinsurance percentage member pays",
        description="Percentage the member pays after deductible",
    )
    pcp_copay = DecimalField(
        query="primary care physician copay PCP",
        description="Copay for primary care visits",
    )
    preventive_covered = BooleanField(
        query="preventive care covered no cost",
        description="Whether preventive care is covered at 100%",
        required=False,
    )


# Load with explicit chunking
doc = Document.from_file(
    "summary_of_benefits.pdf",
    chunker=SentenceChunker(target_size=1000, overlap=200),
)

# Extract with custom instructions
extractor = Extractor(
    schema=HealthPlan,
    llm=AnthropicLLM(model="claude-sonnet-4-20250514"),
    searcher=BM25Searcher(),
    top_k=5,
    extra_instructions="""
    - Extract IN-NETWORK values when both in/out-of-network are shown
    - Coinsurance is what the MEMBER pays, not the plan
    - "No charge" or "Covered in full" means $0
    """,
)

result = extractor.extract(doc)

# Results
print(f"Plan: {result.data.plan_name} ({result.data.plan_type})")
print(f"Deductible: ${result.data.individual_deductible}")
print(f"Coinsurance: {result.data.coinsurance}%")
print(f"Status: {result.status}")

# Evidence
for field, evidence in result.evidence_map.items():
    print(f"{field}: \"{evidence.quote[:50]}...\" (page {evidence.page})")

Search Types

Fields can use different search strategies:

class MySchema(ExtractionSchema):
    # BM25: Keyword search (fast, no embeddings)
    invoice_number = StringField(
        query="invoice number invoice #",
        search_type=SearchType.BM25,
    )

    # Semantic: Vector similarity (requires embeddings)
    description = StringField(
        query="product service description",
        search_type=SearchType.SEMANTIC,
    )

    # Hybrid: BM25 + semantic with rank fusion
    vendor = StringField(
        query="vendor company supplier",
        search_type=SearchType.HYBRID,
    )

For semantic/hybrid, provide a retriever:

from pullcite.embeddings.openai import OpenAIEmbedder
from pullcite.retrieval.memory import MemoryRetriever

extractor = Extractor(
    schema=MySchema,
    llm=my_llm,
    searcher=BM25Searcher(),
    retriever=MemoryRetriever(OpenAIEmbedder()),
)

Custom Prompts

# Append instructions to default prompt
extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    extra_instructions="All amounts are in USD. Use Grand Total, not subtotals.",
)

# Replace entire system prompt
extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    system_prompt="You are an expert invoice parser. Extract precisely.",
)

# Full control with custom builder
def my_prompt_builder(schema, field_contexts):
    lines = ["Extract these fields:"]
    for name, field in schema.get_fields().items():
        contexts = field_contexts.get(name, [])
        lines.append(f"\n## {name}: {field.description or ''}")
        if contexts:
            lines.append(f"Found in: {contexts[0].text[:200]}")
    return "\n".join(lines)

extractor = Extractor(..., prompt_builder=my_prompt_builder)

Large Documents

For schemas with many fields or large documents:

extractor = Extractor(
    schema=LargeSchema,
    llm=my_llm,
    searcher=BM25Searcher(),
    max_fields_per_batch=10,      # Split into multiple LLM calls
    max_context_chars=50000,      # Limit context per batch
    include_document_text=False,  # Use only retrieved chunks
    top_k=10,                     # More chunks per field
)

LLM Providers

from pullcite.llms.anthropic import AnthropicLLM
from pullcite.llms.openai import OpenAILLM

llm = AnthropicLLM(model="claude-sonnet-4-20250514")  # Uses ANTHROPIC_API_KEY
llm = OpenAILLM(model="gpt-4o")                       # Uses OPENAI_API_KEY

Environment Variables

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export VOYAGE_API_KEY="..."

Project Structure

pullcite/
├── core/
│   ├── document.py      # Document loading
│   ├── chunk.py         # Chunk dataclass
│   ├── chunker.py       # Chunking strategies
│   ├── evidence.py      # Evidence types
│   └── result.py        # ExtractionResult
├── schema/
│   ├── base.py          # ExtractionSchema, Field
│   ├── fields.py        # Field types
│   └── extractor.py     # Extractor
├── search/
│   ├── bm25.py          # BM25Searcher
│   └── hybrid.py        # HybridSearcher
├── embeddings/          # OpenAI, Voyage, local
├── retrieval/           # Memory, Chroma, pgvector
└── llms/                # Anthropic, OpenAI

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pullcite-0.2.3.tar.gz (387.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pullcite-0.2.3-py3-none-any.whl (81.8 kB view details)

Uploaded Python 3

File details

Details for the file pullcite-0.2.3.tar.gz.

File metadata

  • Download URL: pullcite-0.2.3.tar.gz
  • Upload date:
  • Size: 387.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pullcite-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1efd200cf896058a466233607f7e4ddf45d59d4e288e14e7bff97cd74bdc832e
MD5 e4bc0d44a4b43f87a24c443ae732d905
BLAKE2b-256 8fc7d5fcb4eea7c2f0cb7bf0628787b1c5a453f4edc501f9cb6eeb65f9a7b2cf

See more details on using hashes here.

File details

Details for the file pullcite-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: pullcite-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 81.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pullcite-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8ca088279c03fd100024d2ea301cb8f8a832232ce8fe53fd2f8809ad5521451d
MD5 b360297e7bcdecf90a416ca3a5e5dc97
BLAKE2b-256 e1d1bf787bff0ca266cafd85b5037caa01a268b19555480faecad94befea411c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page