AI-powered tool for separating multi-statement PDF files using LangChain and LangGraph

These details have not been verified by PyPI

Project links

Project description

Bank Statement Separator

An AI-powered tool that automatically processes PDF files containing multiple bank statements and separates them into individual files. Built with LangChain and LangGraph for robust stateful AI processing.

🚀 Features

AI-Powered Analysis: Uses advanced language models to detect statement boundaries
Multiple LLM Support: Compatible with OpenAI GPT models and Ollama local models
PDF Processing: Efficient document manipulation using PyMuPDF
Metadata Extraction: Automatically extracts account numbers, dates, and bank information
File Organization: Generates meaningful filenames following configurable patterns
Error Handling: Comprehensive logging and audit trails
Error Detection & Tagging: Automatic identification and tagging of processing issues (v0.3.0+)
Security Controls: Built-in safeguards for production use
Paperless Integration: Optional integration with Paperless-ngx for document management

📋 Requirements

Python 3.11+
OpenAI API key (for LLM functionality)
UV package manager
Optional: Paperless-ngx instance (for document management and error tagging)

🛠 Installation

1. Clone the Repository

git clone https://github.com/madeinoz67/bank-statement-separator.git
cd bank-statement-separator

2. Install Dependencies

# Install with UV
uv sync

# Install with dev dependencies
uv sync --group dev

3. Configure Environment

Copy the example environment file and configure your settings:

cp .env.example .env

Edit .env to set your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

📖 Usage

Basic Usage

# Process a single PDF file
uv run python -m src.bank_statement_separator.main input.pdf

# Specify output directory
uv run python -m src.bank_statement_separator.main input.pdf -o ./output

# Use verbose logging
uv run python -m src.bank_statement_separator.main input.pdf --verbose

# Dry run mode (no files written)
uv run python -m src.bank_statement_separator.main input.pdf --dry-run

Advanced Options

# Specify LLM model
uv run python -m src.bank_statement_separator.main input.pdf --model gpt-4o

# Set custom processing limits
uv run python -m src.bank_statement_separator.main input.pdf --max-pages 50

# Enable debug mode
uv run python -m src.bank_statement_separator.main input.pdf --debug

Configuration

The application uses environment variables for configuration. Key settings include:

OPENAI_API_KEY: Your OpenAI API key
OLLAMA_BASE_URL: Ollama server URL (for local models)
LOG_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, ERROR)
MAX_PAGES_PER_STATEMENT: Processing limits
OUTPUT_DIR: Default output directory

Error Detection Configuration (v0.3.0+)

# Enable error detection and tagging
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS="error:processing,needs:review"
PAPERLESS_ERROR_SEVERITY_LEVELS="medium,high,critical"

See Configuration Guide for complete details.

🏗 Architecture

The system consists of several key components:

Workflow Engine: LangGraph-based state machine for processing steps
LLM Analyzer: AI-powered boundary detection and metadata extraction
PDF Processor: Document manipulation and text extraction
Error Handler: Comprehensive error management and recovery
Error Detection System: Automatic identification and tagging of processing issues (v0.3.0+)
Rate Limiter: API usage controls and backoff mechanisms

Processing Pipeline

PDF Ingestion: Load and validate input documents
Document Analysis: Extract text and structural information
Statement Detection: AI boundary detection using LLM analysis
Metadata Extraction: Account and period information extraction
PDF Generation: Create individual statement files
File Organization: Apply naming conventions and organization
Paperless Upload: Optional document management integration
Error Detection: Automatic identification and tagging of processing issues (v0.3.0+)

🧪 Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src

# Run specific test categories
uv run pytest tests/unit/
uv run pytest tests/integration/

# Test error detection and tagging (v0.3.0+)
uv run pytest tests/unit/test_error_tagging*.py -v
uv run python tests/manual/test_error_tagging_e2e.py

🤝 Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Install development dependencies: uv sync --group dev
Make your changes
Run tests: uv run pytest
Format code: uv run ruff format .
Check linting: uv run ruff check .
Commit your changes with a descriptive message
Push to your fork and create a pull request

Code Quality

Follow PEP 8 style guidelines
Use type hints for all function parameters and return values
Write comprehensive docstrings for public APIs
Add tests for new features and bug fixes
Keep functions focused and small
Use descriptive variable and function names

Pull Request Process

Ensure all tests pass
Update documentation if needed
Add appropriate commit trailers (see below)
Request review from maintainers

Commit Guidelines

For commits fixing bugs or adding features based on user reports:

git commit --trailer "Reported-by:<name>"

For commits related to a GitHub issue:

git commit --trailer "Github-Issue:#<number>"

📚 Documentation

📖 Read the full documentation online

Complete documentation is available in the docs/ directory:

Build documentation locally:

uv run mkdocs serve

🔍 Error Detection & Tagging (v0.3.0+)

The system includes comprehensive error detection that automatically identifies processing issues and applies configurable tags for manual review:

Error Categories

LLM Analysis Failures: API errors, model failures, fallback usage
Boundary Detection Issues: Low confidence boundaries, suspicious patterns
PDF Processing Errors: File corruption, access issues, format problems
Metadata Extraction Failures: Missing account data, date parsing issues
Validation Failures: Content validation, integrity checks
File Output Issues: Write failures, permissions, disk space

Error Detection Setup

# Basic error detection setup
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS="error:processing,needs:review"
PAPERLESS_ERROR_SEVERITY_LEVELS="medium,high,critical"

# Advanced configuration
PAPERLESS_ERROR_TAG_THRESHOLD=0.7
PAPERLESS_ERROR_BATCH_TAGGING=true

See the Paperless Integration Guide for complete configuration details.

📦 Dependencies

Core Dependencies

langchain: LLM integration framework
langgraph: Stateful workflow orchestration
pymupdf: PDF processing
pydantic: Data validation
rich: Terminal formatting
python-dotenv: Environment management

Development Dependencies

pytest: Testing framework
ruff: Code formatting and linting
pyright: Type checking
mkdocs: Documentation generation

See pyproject.toml for complete dependency list.

🔒 Security

API keys are managed through environment variables
Input validation on all user-provided data
Rate limiting for external API calls
Comprehensive logging for audit trails
No sensitive data stored in application logs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with LangChain and LangGraph
PDF processing powered by PyMuPDF
Inspired by the need for automated document processing in financial workflows

🐛 Issues & Support

Report bugs via GitHub Issues
Check Troubleshooting Guide for common issues
Review Known Issues for current limitations

Note: This tool requires an OpenAI API key for AI functionality. Falls back to pattern matching if LLM is unavailable.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Sep 11, 2025

This version

0.3.1

Sep 11, 2025

0.3.0

Sep 10, 2025

0.2.0

Sep 8, 2025

0.1.5

Sep 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bank_statement_separator-0.3.1.tar.gz (83.2 kB view details)

Uploaded Sep 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bank_statement_separator-0.3.1-py3-none-any.whl (88.4 kB view details)

Uploaded Sep 11, 2025 Python 3

File details

Details for the file bank_statement_separator-0.3.1.tar.gz.

File metadata

Download URL: bank_statement_separator-0.3.1.tar.gz
Upload date: Sep 11, 2025
Size: 83.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bank_statement_separator-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`92a5e403f6f2ddbb5319b18b9a337351ce415b71d4724f181903135dac717ce4`
MD5	`914f3a3a5d4547f645ece3e5795abd41`
BLAKE2b-256	`08a261390807ff62e6be80d84eb0a0972d71c657be93af37c43ff7fd5fa96985`

See more details on using hashes here.

File details

Details for the file bank_statement_separator-0.3.1-py3-none-any.whl.

File metadata

Download URL: bank_statement_separator-0.3.1-py3-none-any.whl
Upload date: Sep 11, 2025
Size: 88.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bank_statement_separator-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ffd7aa71bac5b64bbbe12d2f958402b5d7ed234b708a55827bbd9487bdb60fa`
MD5	`b2cff8ae8fd8ef6b33d86bd1bc21c59d`
BLAKE2b-256	`2a174433380b6da1210fbb649249051ff372c866edfd1c2757971f6e11500c94`

See more details on using hashes here.

bank-statement-separator 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bank Statement Separator

🚀 Features

📋 Requirements

🛠 Installation

1. Clone the Repository

2. Install Dependencies

3. Configure Environment

📖 Usage

Basic Usage

Advanced Options

Configuration

Error Detection Configuration (v0.3.0+)

🏗 Architecture

Processing Pipeline

🧪 Testing

🤝 Contributing

Development Setup

Code Quality

Pull Request Process

Commit Guidelines

📚 Documentation

🔍 Error Detection & Tagging (v0.3.0+)

Error Categories

Error Detection Setup

📦 Dependencies

Core Dependencies

Development Dependencies

🔒 Security

📄 License

🙏 Acknowledgments

🐛 Issues & Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes