Evidence-backed structured extraction from documents

These details have not been verified by PyPI

Project links

Project description

Pullcite

Alpha — API may change. Expect breaking changes until v1.0.

Evidence-backed structured extraction from documents.

Extract structured data from documents using LLMs while providing proof of where each value came from (quote + page + bounding box).

from pullcite import Document, Extractor, ExtractionSchema, StringField, DecimalField, BM25Searcher
from pullcite.llms.anthropic import AnthropicLLM

class Invoice(ExtractionSchema):
    vendor = StringField(query="vendor company name", description="Company that issued the invoice")
    total = DecimalField(query="total amount due", description="Final amount due")

extractor = Extractor(schema=Invoice, llm=AnthropicLLM(), searcher=BM25Searcher())
result = extractor.extract(Document.from_file("invoice.pdf"))

print(result.data.vendor)                # "Acme Corp"
print(result.data.total)                 # Decimal("1500.00")
print(result.evidence_map["total"].quote) # "Grand Total: $1,500.00"
print(result.evidence_map["total"].page)  # 1

Installation

pip install pullcite                 # Core
pip install pullcite[anthropic]      # + Claude
pip install pullcite[openai]         # + GPT
pip install pullcite[docling]        # + PDF/DOCX parsing
pip install pullcite[all]            # Everything

Defining Schemas

Each field specifies a search query and description:

from pullcite import ExtractionSchema, StringField, DecimalField, PercentField, BooleanField

class HealthPlan(ExtractionSchema):
    plan_name = StringField(
        query="plan name health plan title",
        description="Official name of the health insurance plan",
    )
    individual_deductible = DecimalField(
        query="individual deductible annual",
        description="Annual deductible for individual coverage (in-network)",
    )
    coinsurance = PercentField(
        query="coinsurance percentage member pays",
        description="Percentage the member pays after deductible (not plan's share)",
    )
    preventive_covered = BooleanField(
        query="preventive care covered",
        description="Whether preventive care is covered at no cost",
        required=False,
    )

Field Types

Type	Python	Parses
`StringField`	`str`	Text
`IntegerField`	`int`	`100`, `"100 days"`
`DecimalField`	`Decimal`	`"$1,500.00"` → `1500.00`
`PercentField`	`float`	`"30%"`, `0.30` → `30.0`
`BooleanField`	`bool`	`"yes"`, `"true"`, `1` → `True`
`DateField`	`str`	`"2024-01-15"`
`ListField`	`list`	`["a", "b"]`
`EnumField`	`str`	Must match choices

Descriptions Matter

Descriptions tell the LLM what each field means. Without them, extraction is less accurate:

# Good - LLM understands context
total = DecimalField(query="total", description="Final invoice total including tax")

# Works but less accurate - LLM only sees field name "total"
total = DecimalField(query="total")

Chunking

Chunking controls how documents are split for search. Critical for extraction quality.

from pullcite import Document, SlidingWindowChunker, SentenceChunker

# Explicit chunking (recommended)
doc = Document.from_file("report.pdf", chunker=SlidingWindowChunker(size=500, stride=250))

# Sentence-aware chunking (better for prose)
doc = Document.from_file("contract.pdf", chunker=SentenceChunker(target_size=1000, overlap=200))

# Default: SentenceChunker(target_size=1200, overlap=200)
doc = Document.from_file("invoice.pdf")

Chunker	Use Case
`SlidingWindowChunker(size, stride)`	Predictable, good default
`SentenceChunker(target_size, overlap)`	Prose documents
`ParagraphChunker(target_size, overlap_paragraphs)`	Well-structured docs

Extraction

from pullcite import Extractor, BM25Searcher
from pullcite.llms.anthropic import AnthropicLLM

extractor = Extractor(
    schema=HealthPlan,
    llm=AnthropicLLM(),
    searcher=BM25Searcher(),
    top_k=5,         # Chunks per field (tune for your docs)
    verify=True,     # Verify against source (default)
)

result = extractor.extract(doc)

top_k Parameter

Controls chunks retrieved per field. Critical for quality.

top_k=5 (default) - Good starting point
Higher = more context, better accuracy, more tokens
Lower = faster, cheaper, may miss context

Async Support

import asyncio

results = await asyncio.gather(*[
    extractor.extract_async(doc) for doc in documents
])

Verification

Pullcite verifies extracted values against the source document:

Search - Retrieve chunks using field's query
Parse - Find values in chunk text via field.parse_from_text()
Compare - Check match via field.compare() (type-aware: tolerances for decimals, case-insensitive for strings)

print(result.status)  # VERIFIED, PARTIAL, or FAILED

for vr in result.verification_results:
    print(f"{vr.path}: {vr.status.value}")
    # MATCH - Value verified in source
    # MISMATCH - Found different value
    # NOT_FOUND - Required field missing
    # SKIPPED - No context to verify

Evidence

Every verified field has traceable evidence:

evidence = result.evidence_map["total"]
print(evidence.quote)       # "Grand Total: $1,500.00"
print(evidence.page)        # 1
print(evidence.bbox)        # (72.0, 540.2, 200.5, 555.8)
print(evidence.confidence)  # 0.95
print(evidence.verified)    # True

Complete Example

from pullcite import (
    Document, Extractor, ExtractionSchema, BM25Searcher,
    StringField, DecimalField, PercentField, BooleanField,
    SentenceChunker,
)
from pullcite.llms.anthropic import AnthropicLLM


class HealthPlan(ExtractionSchema):
    plan_name = StringField(
        query="plan name health plan title",
        description="Official name of the health insurance plan",
    )
    plan_type = StringField(
        query="plan type HMO PPO EPO",
        description="Type of plan: HMO, PPO, EPO, or POS",
    )
    individual_deductible = DecimalField(
        query="individual deductible annual",
        description="Annual deductible for individual coverage (in-network)",
    )
    family_deductible = DecimalField(
        query="family deductible annual",
        description="Annual deductible for family coverage (in-network)",
    )
    coinsurance = PercentField(
        query="coinsurance percentage member pays",
        description="Percentage the member pays after deductible",
    )
    pcp_copay = DecimalField(
        query="primary care physician copay PCP",
        description="Copay for primary care visits",
    )
    preventive_covered = BooleanField(
        query="preventive care covered no cost",
        description="Whether preventive care is covered at 100%",
        required=False,
    )


# Load with explicit chunking
doc = Document.from_file(
    "summary_of_benefits.pdf",
    chunker=SentenceChunker(target_size=1000, overlap=200),
)

# Extract with custom instructions
extractor = Extractor(
    schema=HealthPlan,
    llm=AnthropicLLM(model="claude-sonnet-4-20250514"),
    searcher=BM25Searcher(),
    top_k=5,
    extra_instructions="""
    - Extract IN-NETWORK values when both in/out-of-network are shown
    - Coinsurance is what the MEMBER pays, not the plan
    - "No charge" or "Covered in full" means $0
    """,
)

result = extractor.extract(doc)

# Results
print(f"Plan: {result.data.plan_name} ({result.data.plan_type})")
print(f"Deductible: ${result.data.individual_deductible}")
print(f"Coinsurance: {result.data.coinsurance}%")
print(f"Status: {result.status}")

# Evidence
for field, evidence in result.evidence_map.items():
    print(f"{field}: \"{evidence.quote[:50]}...\" (page {evidence.page})")

Search Types

Fields can use different search strategies:

class MySchema(ExtractionSchema):
    # BM25: Keyword search (fast, no embeddings)
    invoice_number = StringField(
        query="invoice number invoice #",
        search_type=SearchType.BM25,
    )

    # Semantic: Vector similarity (requires embeddings)
    description = StringField(
        query="product service description",
        search_type=SearchType.SEMANTIC,
    )

    # Hybrid: BM25 + semantic with rank fusion
    vendor = StringField(
        query="vendor company supplier",
        search_type=SearchType.HYBRID,
    )

For semantic/hybrid, provide a retriever:

from pullcite.embeddings.openai import OpenAIEmbedder
from pullcite.retrieval.memory import MemoryRetriever

extractor = Extractor(
    schema=MySchema,
    llm=my_llm,
    searcher=BM25Searcher(),
    retriever=MemoryRetriever(OpenAIEmbedder()),
)

Custom Prompts

# Append instructions to default prompt
extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    extra_instructions="All amounts are in USD. Use Grand Total, not subtotals.",
)

# Replace entire system prompt
extractor = Extractor(
    schema=Invoice,
    llm=my_llm,
    searcher=BM25Searcher(),
    system_prompt="You are an expert invoice parser. Extract precisely.",
)

# Full control with custom builder
def my_prompt_builder(schema, field_contexts):
    lines = ["Extract these fields:"]
    for name, field in schema.get_fields().items():
        contexts = field_contexts.get(name, [])
        lines.append(f"\n## {name}: {field.description or ''}")
        if contexts:
            lines.append(f"Found in: {contexts[0].text[:200]}")
    return "\n".join(lines)

extractor = Extractor(..., prompt_builder=my_prompt_builder)

Large Documents

For schemas with many fields or large documents:

extractor = Extractor(
    schema=LargeSchema,
    llm=my_llm,
    searcher=BM25Searcher(),
    max_fields_per_batch=10,      # Split into multiple LLM calls
    max_context_chars=50000,      # Limit context per batch
    include_document_text=False,  # Use only retrieved chunks
    top_k=10,                     # More chunks per field
)

LLM Providers

from pullcite.llms.anthropic import AnthropicLLM
from pullcite.llms.openai import OpenAILLM

llm = AnthropicLLM(model="claude-sonnet-4-20250514")  # Uses ANTHROPIC_API_KEY
llm = OpenAILLM(model="gpt-4o")                       # Uses OPENAI_API_KEY

Environment Variables

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export VOYAGE_API_KEY="..."

Project Structure

pullcite/
├── core/
│   ├── document.py      # Document loading
│   ├── chunk.py         # Chunk dataclass
│   ├── chunker.py       # Chunking strategies
│   ├── evidence.py      # Evidence types
│   └── result.py        # ExtractionResult
├── schema/
│   ├── base.py          # ExtractionSchema, Field
│   ├── fields.py        # Field types
│   └── extractor.py     # Extractor
├── search/
│   ├── bm25.py          # BM25Searcher
│   └── hybrid.py        # HybridSearcher
├── embeddings/          # OpenAI, Voyage, local
├── retrieval/           # Memory, Chroma, pgvector
└── llms/                # Anthropic, OpenAI

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.7

Jan 23, 2026

0.2.6

Jan 23, 2026

0.2.5

Jan 23, 2026

0.2.4

Jan 6, 2026

0.2.3

Jan 6, 2026

This version

0.2.2

Jan 6, 2026

0.2.1

Jan 5, 2026

0.2.0

Jan 5, 2026

0.1.0

Jan 4, 2026

0.0.6

Jan 3, 2026

0.0.5

Jan 3, 2026

0.0.4

Jan 3, 2026

0.0.3

Jan 3, 2026

0.0.2

Jan 2, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pullcite-0.2.2.tar.gz (387.0 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pullcite-0.2.2-py3-none-any.whl (81.8 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file pullcite-0.2.2.tar.gz.

File metadata

Download URL: pullcite-0.2.2.tar.gz
Upload date: Jan 6, 2026
Size: 387.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pullcite-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`d3c408137bcc6c60382b776a3e6245b07b24b907165c36a9c603348f8fa171c3`
MD5	`74719d79b09b3a8c635b39e8e7d5cb56`
BLAKE2b-256	`caea5202c2b7ad32e007d28171f05ac4fb2d6b1540d0b406cf1b7dd20c450e96`

See more details on using hashes here.

File details

Details for the file pullcite-0.2.2-py3-none-any.whl.

File metadata

Download URL: pullcite-0.2.2-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 81.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pullcite-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5fb45b5127d76d78194eaa66350ebf96f7e4d149f43c8c99a5e60cbe5c4eb25e`
MD5	`15d0e2b455d29697a86ab89ac13524e3`
BLAKE2b-256	`4457d8cb17ad570085bd96c2070847fafbe0a1aa108b77aabad45461dd87f557`

See more details on using hashes here.

pullcite 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pullcite

Installation

Defining Schemas

Field Types

Descriptions Matter

Chunking

Extraction

top_k Parameter

Async Support

Verification

Evidence

Complete Example

Search Types

Custom Prompts

Large Documents

LLM Providers

Environment Variables

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes