Composable document data extraction: load, preprocess, OCR, LLM parse, store with vector search.
Project description
billfox
AI-agent-friendly receipt management -- extract, parse, store, and search receipts and invoices with composable pipelines built for LLM workflows.
billfox is a Python library for building document processing pipelines from independent, swappable stages. Load a document, preprocess it, OCR the text, parse it with any LLM into structured data, and store it with hybrid vector search -- each stage implements a simple protocol, so you can mix built-in modules with your own.
How It Works
graph LR
A["Source<br/><small>LocalFileSource</small>"] --> B["Preprocessor<br/><small>Resize, YOLO, Chain</small>"]
B --> C["Extractor<br/><small>Mistral OCR, Docling</small>"]
C --> D["Parser<br/><small>LLMParser (any LLM)</small>"]
D --> E["Store<br/><small>SQLite + hybrid search</small>"]
style A fill:#4a90d9,stroke:#357abd,color:#fff
style B fill:#6c757d,stroke:#565e64,color:#fff
style C fill:#e67e22,stroke:#cf6d17,color:#fff
style D fill:#27ae60,stroke:#1e8449,color:#fff
style E fill:#8e44ad,stroke:#763895,color:#fff
Every boundary is a protocol -- implement DocumentSource, Preprocessor, Extractor, Parser[T], Embedder, or DocumentStore[T] to plug in your own components.
| Stage | Protocol | Built-in |
|---|---|---|
| Source | DocumentSource |
LocalFileSource |
| Preprocessor | Preprocessor |
ResizePreprocessor, YOLOPreprocessor, PreprocessorChain |
| Extractor | Extractor |
MistralExtractor, DoclingExtractor |
| Parser | Parser[T] |
LLMParser[T] |
| Embedder | Embedder |
OpenAIEmbedder |
| Store | DocumentStore[T] |
SQLiteDocumentStore[T] |
Installation
pip install billfox # Core only (types and protocols)
pip install 'billfox[mistral]' # + Mistral OCR
pip install 'billfox[llm]' # + LLM parsing (pydantic-ai)
pip install 'billfox[store]' # + SQLite storage and search
pip install 'billfox[all]' # Everything
All available extras
| Extra | Packages | Use case |
|---|---|---|
mistral |
mistralai |
Mistral OCR extraction |
yolo |
onnxruntime, numpy, Pillow |
YOLO document cropping |
llm |
pydantic-ai |
LLM structured parsing |
openai |
openai |
OpenAI text embeddings |
anthropic |
anthropic |
Anthropic LLM support |
store |
sqlalchemy, aiosqlite, sqlite-vec |
SQLite storage + search |
google-drive |
google-api-python-client, google-auth |
Google Drive backup |
cli |
typer, rich, tomli-w |
Command-line interface |
all |
All of the above | Everything |
Quick Start
Extract Markdown from a Document (OCR only)
import asyncio
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor
async def main():
source = LocalFileSource()
extractor = MistralExtractor() # uses MISTRAL_API_KEY env var
doc = await source.load("invoice.pdf")
result = await extractor.extract(doc)
print(result.markdown)
asyncio.run(main())
Full Pipeline -- OCR + LLM Parse + Store
import asyncio
from pydantic import BaseModel
from billfox import Pipeline
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor
from billfox.parse import LLMParser
from billfox.preprocess import ResizePreprocessor
from billfox.store import SQLiteDocumentStore
class Invoice(BaseModel):
vendor_name: str
total: float
date: str
async def main():
pipeline = Pipeline(
source=LocalFileSource(),
extractor=MistralExtractor(),
parser=LLMParser(
model="openai:gpt-4.1",
output_type=Invoice,
system_prompt="Extract invoice fields from this document.",
),
preprocessors=[ResizePreprocessor(max_side=1024)],
store=SQLiteDocumentStore(db_path="invoices.db", schema=Invoice),
)
invoice = await pipeline.run("scan.jpg", document_id="inv-001")
print(f"{invoice.vendor_name}: ${invoice.total}")
asyncio.run(main())
CLI
# Configure API keys
billfox config set api_keys.mistral sk-...
# Extract markdown via OCR
billfox extract receipt.jpg
# Parse into structured JSON
billfox parse receipt.jpg --schema ./models.py:Receipt --model openai:gpt-4.1
# Search stored documents
billfox search "coffee" --db invoices.db
Extending billfox
Every stage is a Python protocol. Implement the method, pass it to Pipeline, done.
Custom Extractor
from billfox._types import Document, ExtractionResult, Page
from billfox.extract import Extractor
class MyExtractor:
async def extract(self, document: Document) -> ExtractionResult:
text = await call_my_ocr_service(document.content)
return ExtractionResult(
markdown=text,
pages=[Page(index=0, markdown=text)],
metadata={},
)
Custom Preprocessor
from billfox._types import Document
from billfox.preprocess import Preprocessor
class GrayscalePreprocessor:
async def process(self, document: Document) -> Document:
if not document.mime_type.startswith("image/"):
return document # pass through non-images
gray_bytes = convert_to_grayscale(document.content)
return Document(
content=gray_bytes,
mime_type=document.mime_type,
source_uri=document.source_uri,
metadata={**document.metadata, "preprocessor": "grayscale"},
)
Custom Store
from billfox.store import DocumentStore
class MyStore:
async def save(self, document_id: str, data: T) -> None: ...
async def get(self, document_id: str) -> T | None: ...
async def search(self, query: str, *, limit: int = 20) -> list[SearchResult]: ...
async def delete(self, document_id: str) -> None: ...
See the full documentation for more examples:
Core Types
All core types are frozen dataclasses (immutable after creation):
Document(content=b"...", mime_type="image/jpeg", source_uri="receipt.jpg", metadata={})
ExtractionResult(markdown="...", pages=[Page(index=0, markdown="...")], metadata={})
SearchResult(document_id="inv-001", data={...}, score=0.95, signals={...})
Development
Prerequisites
- Python 3.11+
- Git
Setup
git clone https://github.com/billfox-ai/billfox.git
cd billfox
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
The dev extra installs all optional dependencies plus pytest, mypy, ruff, and coverage.
Commands
make test # Run tests
make lint # Lint with ruff
make format # Auto-format with ruff
make typecheck # Type check with mypy (strict)
Project Structure
src/billfox/
__init__.py # Re-exports: Pipeline, Document, ExtractionResult, SearchResult
_types.py # Core frozen dataclasses
_version.py # Version string
pipeline.py # Pipeline compositor
source/ # Document loading (LocalFileSource)
preprocess/ # Image preprocessing (resize, YOLO, chain)
extract/ # OCR / text extraction (Mistral, Docling)
parse/ # LLM structured parsing
embed/ # Text embeddings (OpenAI)
store/ # SQLite storage + hybrid search (BM25 + vector + RRF)
backup/ # Document backup (local, Google Drive)
models/ # Pre-built Pydantic models (Receipt)
cli/ # Typer CLI application
tests/ # pytest suite (26 test files)
docs/ # mkdocs-material documentation
Code Style
- Formatter/linter: ruff (120 char line length)
- Type checker: mypy in strict mode
- Type annotations on all public functions
- Google-style docstrings on public classes/functions
from __future__ import annotationsin all source files (except CLI modules -- typer requires runtime annotations)- Protocols live in
_base.pyfiles with@runtime_checkable - Lazy imports for optional dependencies with clear
ImportErrormessages
Adding a New Module
- Create a
_base.pyprotocol if introducing a new stage - Implement the protocol in a new file
- Re-export in the subpackage
__init__.py - Add optional dependencies to
pyproject.tomlunder a new extra - Write tests with mocked external dependencies
- Add a documentation page under
docs/
Contributing
See CONTRIBUTING.md for the full guide. The short version:
- Fork and create a feature branch from
main - Implement with tests
- Run
make lint && make typecheck && make test - Submit a PR
License
MIT -- see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file billfox-0.2.11.tar.gz.
File metadata
- Download URL: billfox-0.2.11.tar.gz
- Upload date:
- Size: 9.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91f99608a0b9bb700063b052556046391de01b02179ed81b1983fa0614d86ad4
|
|
| MD5 |
c394117bd3eb663bdd75c8d9e239ce87
|
|
| BLAKE2b-256 |
d2a03732fcb8d608b3575a1e5f8dd46c880d5276c4c5b4113b2bc69278569956
|
File details
Details for the file billfox-0.2.11-py3-none-any.whl.
File metadata
- Download URL: billfox-0.2.11-py3-none-any.whl
- Upload date:
- Size: 9.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2601382526f2a8d824765da912231897b71b621a750fe3f6ee1606eae2537f5c
|
|
| MD5 |
2dfac59ecdd61167071edf8a767fd036
|
|
| BLAKE2b-256 |
46ce6bb3e7543284c5efa626c22690f4c147b3027835f22314bacf67838aae44
|