Evidence-backed structured extraction from documents
Project description
Pullcite
Alpha — API may change. Expect breaking changes until v1.0.
Evidence-backed structured extraction from documents.
Extract structured data from documents using LLMs while providing proof of where each value came from (quote + page + bounding box).
from pullcite import Document, Extractor, ExtractionSchema, StringField, DecimalField, BM25Searcher
from pullcite.llms.anthropic import AnthropicLLM
class Invoice(ExtractionSchema):
vendor = StringField(query="vendor company name", description="Company that issued the invoice")
total = DecimalField(query="total amount due", description="Final amount due")
extractor = Extractor(schema=Invoice, llm=AnthropicLLM(), searcher=BM25Searcher())
result = extractor.extract(Document.from_file("invoice.pdf"))
print(result.data.vendor) # "Acme Corp"
print(result.data.total) # Decimal("1500.00")
print(result.evidence_map["total"].quote) # "Grand Total: $1,500.00"
print(result.evidence_map["total"].page) # 1
Installation
pip install pullcite # Core
pip install pullcite[anthropic] # + Claude
pip install pullcite[openai] # + GPT
pip install pullcite[docling] # + PDF/DOCX parsing
pip install pullcite[all] # Everything
Defining Schemas
Each field specifies a search query and description:
from pullcite import ExtractionSchema, StringField, DecimalField, PercentField, BooleanField
class HealthPlan(ExtractionSchema):
plan_name = StringField(
query="plan name health plan title",
description="Official name of the health insurance plan",
)
individual_deductible = DecimalField(
query="individual deductible annual",
description="Annual deductible for individual coverage (in-network)",
)
coinsurance = PercentField(
query="coinsurance percentage member pays",
description="Percentage the member pays after deductible (not plan's share)",
)
preventive_covered = BooleanField(
query="preventive care covered",
description="Whether preventive care is covered at no cost",
required=False,
)
Field Types
| Type | Python | Parses |
|---|---|---|
StringField |
str |
Text |
IntegerField |
int |
100, "100 days" |
DecimalField |
Decimal |
"$1,500.00" → 1500.00 |
PercentField |
float |
"30%", 0.30 → 30.0 |
BooleanField |
bool |
"yes", "true", 1 → True |
DateField |
str |
"2024-01-15" |
ListField |
list |
["a", "b"] |
EnumField |
str |
Must match choices |
Descriptions Matter
Descriptions tell the LLM what each field means. Without them, extraction is less accurate:
# Good - LLM understands context
total = DecimalField(query="total", description="Final invoice total including tax")
# Works but less accurate - LLM only sees field name "total"
total = DecimalField(query="total")
Chunking
Chunking controls how documents are split for search. Critical for extraction quality.
from pullcite import Document, SlidingWindowChunker, SentenceChunker
# Explicit chunking (recommended)
doc = Document.from_file("report.pdf", chunker=SlidingWindowChunker(size=500, stride=250))
# Sentence-aware chunking (better for prose)
doc = Document.from_file("contract.pdf", chunker=SentenceChunker(target_size=1000, overlap=200))
# Default: SentenceChunker(target_size=1200, overlap=200)
doc = Document.from_file("invoice.pdf")
| Chunker | Use Case |
|---|---|
SlidingWindowChunker(size, stride) |
Predictable, good default |
SentenceChunker(target_size, overlap) |
Prose documents |
ParagraphChunker(target_size, overlap_paragraphs) |
Well-structured docs |
Extraction
from pullcite import Extractor, BM25Searcher
from pullcite.llms.anthropic import AnthropicLLM
extractor = Extractor(
schema=HealthPlan,
llm=AnthropicLLM(),
searcher=BM25Searcher(),
top_k=5, # Chunks per field (tune for your docs)
verify=True, # Verify against source (default)
)
result = extractor.extract(doc)
Structured Outputs (Claude)
Use to_json_schema() to compile your schema and enable Claude structured outputs
with output_format.
schema_json = HealthPlan.to_json_schema()
extractor = Extractor(
schema=HealthPlan,
llm=AnthropicLLM(structured_output=True),
searcher=BM25Searcher(),
)
Structured outputs cannot be combined with Claude citations.
top_k Parameter
Controls chunks retrieved per field. Critical for quality.
top_k=5(default) - Good starting point- Higher = more context, better accuracy, more tokens
- Lower = faster, cheaper, may miss context
Async Support
import asyncio
results = await asyncio.gather(*[
extractor.extract_async(doc) for doc in documents
])
Verification
Pullcite verifies extracted values against the source document:
- Search - Retrieve chunks using field's query
- Parse - Find values in chunk text via
field.parse_from_text() - Compare - Check match via
field.compare()(type-aware: tolerances for decimals, case-insensitive for strings)
print(result.status) # VERIFIED, PARTIAL, or FAILED
for vr in result.verification_results:
print(f"{vr.path}: {vr.status.value}")
# MATCH - Value verified in source
# MISMATCH - Found different value
# NOT_FOUND - Required field missing
# SKIPPED - No context to verify
Evidence
Every verified field has traceable evidence:
evidence = result.evidence_map["total"]
print(evidence.quote) # "Grand Total: $1,500.00"
print(evidence.page) # 1
print(evidence.bbox) # (72.0, 540.2, 200.5, 555.8)
print(evidence.confidence) # 0.95
print(evidence.verified) # True
Complete Example
from pullcite import (
Document, Extractor, ExtractionSchema, BM25Searcher,
StringField, DecimalField, PercentField, BooleanField,
SentenceChunker,
)
from pullcite.llms.anthropic import AnthropicLLM
class HealthPlan(ExtractionSchema):
plan_name = StringField(
query="plan name health plan title",
description="Official name of the health insurance plan",
)
plan_type = StringField(
query="plan type HMO PPO EPO",
description="Type of plan: HMO, PPO, EPO, or POS",
)
individual_deductible = DecimalField(
query="individual deductible annual",
description="Annual deductible for individual coverage (in-network)",
)
family_deductible = DecimalField(
query="family deductible annual",
description="Annual deductible for family coverage (in-network)",
)
coinsurance = PercentField(
query="coinsurance percentage member pays",
description="Percentage the member pays after deductible",
)
pcp_copay = DecimalField(
query="primary care physician copay PCP",
description="Copay for primary care visits",
)
preventive_covered = BooleanField(
query="preventive care covered no cost",
description="Whether preventive care is covered at 100%",
required=False,
)
# Load with explicit chunking
doc = Document.from_file(
"summary_of_benefits.pdf",
chunker=SentenceChunker(target_size=1000, overlap=200),
)
# Extract with custom instructions
extractor = Extractor(
schema=HealthPlan,
llm=AnthropicLLM(model="claude-sonnet-4-20250514"),
searcher=BM25Searcher(),
top_k=5,
extra_instructions="""
- Extract IN-NETWORK values when both in/out-of-network are shown
- Coinsurance is what the MEMBER pays, not the plan
- "No charge" or "Covered in full" means $0
""",
)
result = extractor.extract(doc)
# Results
print(f"Plan: {result.data.plan_name} ({result.data.plan_type})")
print(f"Deductible: ${result.data.individual_deductible}")
print(f"Coinsurance: {result.data.coinsurance}%")
print(f"Status: {result.status}")
# Evidence
for field, evidence in result.evidence_map.items():
print(f"{field}: \"{evidence.quote[:50]}...\" (page {evidence.page})")
Search Types
Fields can use different search strategies:
class MySchema(ExtractionSchema):
# BM25: Keyword search (fast, no embeddings)
invoice_number = StringField(
query="invoice number invoice #",
search_type=SearchType.BM25,
)
# Semantic: Vector similarity (requires embeddings)
description = StringField(
query="product service description",
search_type=SearchType.SEMANTIC,
)
# Hybrid: BM25 + semantic with rank fusion
vendor = StringField(
query="vendor company supplier",
search_type=SearchType.HYBRID,
)
For semantic/hybrid, provide a retriever:
from pullcite.embeddings.openai import OpenAIEmbedder
from pullcite.retrieval.memory import MemoryRetriever
extractor = Extractor(
schema=MySchema,
llm=my_llm,
searcher=BM25Searcher(),
retriever=MemoryRetriever(OpenAIEmbedder()),
)
Custom Prompts
# Append instructions to default prompt
extractor = Extractor(
schema=Invoice,
llm=my_llm,
searcher=BM25Searcher(),
extra_instructions="All amounts are in USD. Use Grand Total, not subtotals.",
)
# Replace entire system prompt
extractor = Extractor(
schema=Invoice,
llm=my_llm,
searcher=BM25Searcher(),
system_prompt="You are an expert invoice parser. Extract precisely.",
)
# Full control with custom builder
def my_prompt_builder(schema, field_contexts):
lines = ["Extract these fields:"]
for name, field in schema.get_fields().items():
contexts = field_contexts.get(name, [])
lines.append(f"\n## {name}: {field.description or ''}")
if contexts:
lines.append(f"Found in: {contexts[0].text[:200]}")
return "\n".join(lines)
extractor = Extractor(..., prompt_builder=my_prompt_builder)
Large Documents
For schemas with many fields or large documents:
extractor = Extractor(
schema=LargeSchema,
llm=my_llm,
searcher=BM25Searcher(),
max_fields_per_batch=10, # Split into multiple LLM calls
max_context_chars=50000, # Limit context per batch
include_document_text=False, # Use only retrieved chunks
top_k=10, # More chunks per field
)
LLM Providers
from pullcite.llms.anthropic import AnthropicLLM
from pullcite.llms.openai import OpenAILLM
llm = AnthropicLLM(model="claude-sonnet-4-20250514") # Uses ANTHROPIC_API_KEY
llm = OpenAILLM(model="gpt-4o") # Uses OPENAI_API_KEY
Environment Variables
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export VOYAGE_API_KEY="..."
Project Structure
pullcite/
├── core/
│ ├── document.py # Document loading
│ ├── chunk.py # Chunk dataclass
│ ├── chunker.py # Chunking strategies
│ ├── evidence.py # Evidence types
│ └── result.py # ExtractionResult
├── schema/
│ ├── base.py # ExtractionSchema, Field
│ ├── fields.py # Field types
│ └── extractor.py # Extractor
├── search/
│ ├── bm25.py # BM25Searcher
│ └── hybrid.py # HybridSearcher
├── embeddings/ # OpenAI, Voyage, local
├── retrieval/ # Memory, Chroma, pgvector
└── llms/ # Anthropic, OpenAI
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pullcite-0.2.7.tar.gz.
File metadata
- Download URL: pullcite-0.2.7.tar.gz
- Upload date:
- Size: 413.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
336a87e4257c405acb4dde33912fbfee296ed995b0ff18a570d26db41bae881c
|
|
| MD5 |
123f783892cadccea732741592782b52
|
|
| BLAKE2b-256 |
4e3544f0862d38ef5cbede3c55963d30bead516cde059e06098d219c81009694
|
File details
Details for the file pullcite-0.2.7-py3-none-any.whl.
File metadata
- Download URL: pullcite-0.2.7-py3-none-any.whl
- Upload date:
- Size: 83.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a87fb095d690e49d985ae2cc3d5002febf96af7602595fe886dda1ed33ec585
|
|
| MD5 |
df8fc9dec4fb2a714e2285ac566ff114
|
|
| BLAKE2b-256 |
dc1c494917351da3d1fc63c372d84f91c165e66eec4f5aef37f5dffc524fe095
|