Skip to main content

Composable document data extraction: load, preprocess, OCR, LLM parse, store with vector search.

Project description

billfox

PyPI version Python 3.11+ License: MIT CI

Composable document data extraction: load, preprocess, OCR, LLM parse, store with vector search.

billfox is a Python library that lets you build document processing pipelines from independent, swappable stages. Each stage implements a simple protocol, so you can mix built-in modules with your own.

Architecture

                          billfox pipeline
 ┌─────────┐  ┌──────────────┐  ┌───────────┐  ┌────────┐  ┌───────┐
 │  Source  │→ │ Preprocessor │→ │ Extractor │→ │ Parser │→ │ Store │
 │         │  │   (optional)  │  │   (OCR)   │  │ (LLM)  │  │       │
 └─────────┘  └──────────────┘  └───────────┘  └────────┘  └───────┘
  LocalFile    Resize, YOLO,     MistralOCR     LLMParser   SQLite +
               Chain                            (any LLM)   hybrid
                                                            search

Protocols at every boundary -- implement DocumentSource, Preprocessor, Extractor, Parser[T], Embedder, or DocumentStore[T] to plug in your own components.

Installation

# Core only (just types and protocols)
pip install billfox

# With Mistral OCR
pip install 'billfox[mistral]'

# With LLM parsing (pydantic-ai)
pip install 'billfox[llm]'

# With SQLite storage and search
pip install 'billfox[store]'

# With CLI
pip install 'billfox[cli]'

# Everything
pip install 'billfox[all]'

Quick Start

1. OCR Only -- Extract Markdown from a Document

import asyncio
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor

async def main():
    source = LocalFileSource()
    extractor = MistralExtractor()  # uses MISTRAL_API_KEY env var

    doc = await source.load("invoice.pdf")
    result = await extractor.extract(doc)
    print(result.markdown)

asyncio.run(main())

2. Full Pipeline -- OCR + LLM Parse + Store

import asyncio
from pydantic import BaseModel
from billfox import Pipeline
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor
from billfox.parse import LLMParser
from billfox.preprocess import ResizePreprocessor
from billfox.store import SQLiteDocumentStore

class Invoice(BaseModel):
    vendor_name: str
    total: float
    date: str

async def main():
    pipeline = Pipeline(
        source=LocalFileSource(),
        extractor=MistralExtractor(),
        parser=LLMParser(
            model="openai:gpt-4.1",
            output_type=Invoice,
            system_prompt="Extract invoice fields from this document.",
        ),
        preprocessors=[ResizePreprocessor(max_side=1024)],
        store=SQLiteDocumentStore(db_path="invoices.db", schema=Invoice),
    )

    invoice = await pipeline.run("scan.jpg", document_id="inv-001")
    print(f"{invoice.vendor_name}: ${invoice.total}")

asyncio.run(main())

3. CLI -- Process from the Terminal

# Extract markdown via OCR
billfox extract receipt.jpg

# Parse into structured JSON
billfox parse receipt.jpg --schema ./models.py:Receipt --model openai:gpt-4.1

# Search stored documents
billfox search "coffee" --db invoices.db

# Configure API keys
billfox config set api_keys.mistral sk-...

Optional Extras

Extra Packages Use case
mistral mistralai Mistral OCR extraction
yolo onnxruntime, numpy, Pillow YOLO document cropping
llm pydantic-ai LLM structured parsing
openai openai OpenAI text embeddings
store sqlalchemy, aiosqlite, sqlite-vec SQLite storage + search
cli typer, rich, tomli-w Command-line interface
all All of the above Everything

Documentation

Full documentation is available at docs/:

Contributing

See CONTRIBUTING.md for development setup, running tests, and submitting pull requests.

License

MIT -- see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

billfox-0.1.0.tar.gz (9.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

billfox-0.1.0-py3-none-any.whl (9.3 MB view details)

Uploaded Python 3

File details

Details for the file billfox-0.1.0.tar.gz.

File metadata

  • Download URL: billfox-0.1.0.tar.gz
  • Upload date:
  • Size: 9.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.22

File hashes

Hashes for billfox-0.1.0.tar.gz
Algorithm Hash digest
SHA256 05c95de76f98893136f0819cbd078a4df1f7a50fb6d2c07ccc9fd1e2b48877eb
MD5 5d0e9133b03ee1b5945bba57bcc98ed4
BLAKE2b-256 531689ff2d0fb4a14c4d89798322538341f5bebf3c8cdd7d465b5e9c64e916db

See more details on using hashes here.

File details

Details for the file billfox-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: billfox-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.22

File hashes

Hashes for billfox-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5af47d4ba9b1f69ac5ea542bcf9a78d64454ff324a47bddfa64501be29103ccd
MD5 1c9165f96dfc933336b9c039dea113a5
BLAKE2b-256 d00d07f1bbad2113e901ee0aacef8f85046539c84b4ece9801adb14e7d5e9938

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page