Composable document data extraction: load, preprocess, OCR, LLM parse, store with vector search.
Project description
billfox
Composable document data extraction: load, preprocess, OCR, LLM parse, store with vector search.
billfox is a Python library that lets you build document processing pipelines from independent, swappable stages. Each stage implements a simple protocol, so you can mix built-in modules with your own.
Architecture
billfox pipeline
┌─────────┐ ┌──────────────┐ ┌───────────┐ ┌────────┐ ┌───────┐
│ Source │→ │ Preprocessor │→ │ Extractor │→ │ Parser │→ │ Store │
│ │ │ (optional) │ │ (OCR) │ │ (LLM) │ │ │
└─────────┘ └──────────────┘ └───────────┘ └────────┘ └───────┘
LocalFile Resize, YOLO, MistralOCR LLMParser SQLite +
Chain (any LLM) hybrid
search
Protocols at every boundary -- implement DocumentSource, Preprocessor, Extractor, Parser[T], Embedder, or DocumentStore[T] to plug in your own components.
Installation
# Core only (just types and protocols)
pip install billfox
# With Mistral OCR
pip install 'billfox[mistral]'
# With LLM parsing (pydantic-ai)
pip install 'billfox[llm]'
# With SQLite storage and search
pip install 'billfox[store]'
# With CLI
pip install 'billfox[cli]'
# Everything
pip install 'billfox[all]'
Quick Start
1. OCR Only -- Extract Markdown from a Document
import asyncio
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor
async def main():
source = LocalFileSource()
extractor = MistralExtractor() # uses MISTRAL_API_KEY env var
doc = await source.load("invoice.pdf")
result = await extractor.extract(doc)
print(result.markdown)
asyncio.run(main())
2. Full Pipeline -- OCR + LLM Parse + Store
import asyncio
from pydantic import BaseModel
from billfox import Pipeline
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor
from billfox.parse import LLMParser
from billfox.preprocess import ResizePreprocessor
from billfox.store import SQLiteDocumentStore
class Invoice(BaseModel):
vendor_name: str
total: float
date: str
async def main():
pipeline = Pipeline(
source=LocalFileSource(),
extractor=MistralExtractor(),
parser=LLMParser(
model="openai:gpt-4.1",
output_type=Invoice,
system_prompt="Extract invoice fields from this document.",
),
preprocessors=[ResizePreprocessor(max_side=1024)],
store=SQLiteDocumentStore(db_path="invoices.db", schema=Invoice),
)
invoice = await pipeline.run("scan.jpg", document_id="inv-001")
print(f"{invoice.vendor_name}: ${invoice.total}")
asyncio.run(main())
3. CLI -- Process from the Terminal
# Extract markdown via OCR
billfox extract receipt.jpg
# Parse into structured JSON
billfox parse receipt.jpg --schema ./models.py:Receipt --model openai:gpt-4.1
# Search stored documents
billfox search "coffee" --db invoices.db
# Configure API keys
billfox config set api_keys.mistral sk-...
Optional Extras
| Extra | Packages | Use case |
|---|---|---|
mistral |
mistralai |
Mistral OCR extraction |
yolo |
onnxruntime, numpy, Pillow |
YOLO document cropping |
llm |
pydantic-ai |
LLM structured parsing |
openai |
openai |
OpenAI text embeddings |
store |
sqlalchemy, aiosqlite, sqlite-vec |
SQLite storage + search |
cli |
typer, rich, tomli-w |
Command-line interface |
all |
All of the above | Everything |
Documentation
Full documentation is available at docs/:
Contributing
See CONTRIBUTING.md for development setup, running tests, and submitting pull requests.
License
MIT -- see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file billfox-0.2.0.tar.gz.
File metadata
- Download URL: billfox-0.2.0.tar.gz
- Upload date:
- Size: 9.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a6b6386f6b36bd66d434ea45642c081056488b2a094548e953de18ba4d746ca
|
|
| MD5 |
bb0d4c06033841ab6d5ce20f0a2cc211
|
|
| BLAKE2b-256 |
49a2b362f12f103ed745d3c9e129c835a4bb023653cfc035aeabf2806ca706fd
|
File details
Details for the file billfox-0.2.0-py3-none-any.whl.
File metadata
- Download URL: billfox-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92ef63479c4d0de21ec437d320370b5f7d84d3c2186da418da71fcb06b781607
|
|
| MD5 |
63172107e062d7c6b9344c4bf46c1571
|
|
| BLAKE2b-256 |
6ea1971e108ab7af3c9a7f134c220918c0a5748b4a23c0ad05f209e9e8aa2a76
|