Skip to main content

AI-powered tool for separating multi-statement PDF files using LangChain and LangGraph

Project description

Bank Statement Separator

Documentation Tests PyPI Python Release

An AI-powered tool that automatically processes PDF files containing multiple bank statements and separates them into individual files. Built with LangChain and LangGraph for robust stateful AI processing.

🚀 Features

  • AI-Powered Analysis: Uses advanced language models to detect statement boundaries
  • Multiple LLM Support: Compatible with OpenAI GPT models and Ollama local models
  • PDF Processing: Efficient document manipulation using PyMuPDF
  • Metadata Extraction: Automatically extracts account numbers, dates, and bank information
  • File Organization: Generates meaningful filenames following configurable patterns
  • Error Handling: Comprehensive logging and audit trails
  • Error Detection & Tagging: Automatic identification and tagging of processing issues (v0.3.0+)
  • Security Controls: Built-in safeguards for production use
  • Paperless Integration: Optional integration with Paperless-ngx for document management

📋 Requirements

  • Python 3.11+
  • OpenAI API key (for LLM functionality)
  • UV package manager
  • Optional: Paperless-ngx instance (for document management and error tagging)

🛠 Installation

1. Clone the Repository

git clone https://github.com/madeinoz67/bank-statement-separator.git
cd bank-statement-separator

2. Install Dependencies

# Install with UV
uv sync

# Install with dev dependencies
uv sync --group dev

3. Configure Environment

Copy the example environment file and configure your settings:

cp .env.example .env

Edit .env to set your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

📖 Usage

Basic Usage

# Process a single PDF file
uv run python -m src.bank_statement_separator.main input.pdf

# Specify output directory
uv run python -m src.bank_statement_separator.main input.pdf -o ./output

# Use verbose logging
uv run python -m src.bank_statement_separator.main input.pdf --verbose

# Dry run mode (no files written)
uv run python -m src.bank_statement_separator.main input.pdf --dry-run

Advanced Options

# Specify LLM model
uv run python -m src.bank_statement_separator.main input.pdf --model gpt-4o

# Set custom processing limits
uv run python -m src.bank_statement_separator.main input.pdf --max-pages 50

# Enable debug mode
uv run python -m src.bank_statement_separator.main input.pdf --debug

Configuration

The application uses environment variables for configuration. Key settings include:

  • OPENAI_API_KEY: Your OpenAI API key
  • OLLAMA_BASE_URL: Ollama server URL (for local models)
  • LOG_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, ERROR)
  • MAX_PAGES_PER_STATEMENT: Processing limits
  • OUTPUT_DIR: Default output directory

Error Detection Configuration (v0.3.0+)

# Enable error detection and tagging
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS="error:processing,needs:review"
PAPERLESS_ERROR_SEVERITY_LEVELS="medium,high,critical"

See Configuration Guide for complete details.

🏗 Architecture

The system consists of several key components:

  • Workflow Engine: LangGraph-based state machine for processing steps
  • LLM Analyzer: AI-powered boundary detection and metadata extraction
  • PDF Processor: Document manipulation and text extraction
  • Error Handler: Comprehensive error management and recovery
  • Error Detection System: Automatic identification and tagging of processing issues (v0.3.0+)
  • Rate Limiter: API usage controls and backoff mechanisms

Processing Pipeline

  1. PDF Ingestion: Load and validate input documents
  2. Document Analysis: Extract text and structural information
  3. Statement Detection: AI boundary detection using LLM analysis
  4. Metadata Extraction: Account and period information extraction
  5. PDF Generation: Create individual statement files
  6. File Organization: Apply naming conventions and organization
  7. Paperless Upload: Optional document management integration
  8. Error Detection: Automatic identification and tagging of processing issues (v0.3.0+)

🧪 Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src

# Run specific test categories
uv run pytest tests/unit/
uv run pytest tests/integration/

# Test error detection and tagging (v0.3.0+)
uv run pytest tests/unit/test_error_tagging*.py -v
uv run python tests/manual/test_error_tagging_e2e.py

🤝 Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Install development dependencies: uv sync --group dev
  4. Make your changes
  5. Run tests: uv run pytest
  6. Format code: uv run ruff format .
  7. Check linting: uv run ruff check .
  8. Commit your changes with a descriptive message
  9. Push to your fork and create a pull request

Code Quality

  • Follow PEP 8 style guidelines
  • Use type hints for all function parameters and return values
  • Write comprehensive docstrings for public APIs
  • Add tests for new features and bug fixes
  • Keep functions focused and small
  • Use descriptive variable and function names

Pull Request Process

  1. Ensure all tests pass
  2. Update documentation if needed
  3. Add appropriate commit trailers (see below)
  4. Request review from maintainers

Commit Guidelines

For commits fixing bugs or adding features based on user reports:

git commit --trailer "Reported-by:<name>"

For commits related to a GitHub issue:

git commit --trailer "Github-Issue:#<number>"

📚 Documentation

📖 Read the full documentation online

Complete documentation is available in the docs/ directory:

Build documentation locally:

uv run mkdocs serve

🔍 Error Detection & Tagging (v0.3.0+)

The system includes comprehensive error detection that automatically identifies processing issues and applies configurable tags for manual review:

Error Categories

  • LLM Analysis Failures: API errors, model failures, fallback usage
  • Boundary Detection Issues: Low confidence boundaries, suspicious patterns
  • PDF Processing Errors: File corruption, access issues, format problems
  • Metadata Extraction Failures: Missing account data, date parsing issues
  • Validation Failures: Content validation, integrity checks
  • File Output Issues: Write failures, permissions, disk space

Error Detection Setup

# Basic error detection setup
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS="error:processing,needs:review"
PAPERLESS_ERROR_SEVERITY_LEVELS="medium,high,critical"

# Advanced configuration
PAPERLESS_ERROR_TAG_THRESHOLD=0.7
PAPERLESS_ERROR_BATCH_TAGGING=true

See the Paperless Integration Guide for complete configuration details.

📦 Dependencies

Core Dependencies

  • langchain: LLM integration framework
  • langgraph: Stateful workflow orchestration
  • pymupdf: PDF processing
  • pydantic: Data validation
  • rich: Terminal formatting
  • python-dotenv: Environment management

Development Dependencies

  • pytest: Testing framework
  • ruff: Code formatting and linting
  • pyright: Type checking
  • mkdocs: Documentation generation

See pyproject.toml for complete dependency list.

🔒 Security

  • API keys are managed through environment variables
  • Input validation on all user-provided data
  • Rate limiting for external API calls
  • Comprehensive logging for audit trails
  • No sensitive data stored in application logs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with LangChain and LangGraph
  • PDF processing powered by PyMuPDF
  • Inspired by the need for automated document processing in financial workflows

🐛 Issues & Support


Note: This tool requires an OpenAI API key for AI functionality. Falls back to pattern matching if LLM is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bank_statement_separator-0.3.1.tar.gz (83.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bank_statement_separator-0.3.1-py3-none-any.whl (88.4 kB view details)

Uploaded Python 3

File details

Details for the file bank_statement_separator-0.3.1.tar.gz.

File metadata

  • Download URL: bank_statement_separator-0.3.1.tar.gz
  • Upload date:
  • Size: 83.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bank_statement_separator-0.3.1.tar.gz
Algorithm Hash digest
SHA256 92a5e403f6f2ddbb5319b18b9a337351ce415b71d4724f181903135dac717ce4
MD5 914f3a3a5d4547f645ece3e5795abd41
BLAKE2b-256 08a261390807ff62e6be80d84eb0a0972d71c657be93af37c43ff7fd5fa96985

See more details on using hashes here.

File details

Details for the file bank_statement_separator-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for bank_statement_separator-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ffd7aa71bac5b64bbbe12d2f958402b5d7ed234b708a55827bbd9487bdb60fa
MD5 b2cff8ae8fd8ef6b33d86bd1bc21c59d
BLAKE2b-256 2a174433380b6da1210fbb649249051ff372c866edfd1c2757971f6e11500c94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page