Skip to main content

AI-powered tool for separating multi-statement PDF files using LangChain and LangGraph

Project description

Bank Statement Separator

Documentation Tests PyPI Python

An AI-powered tool that automatically processes PDF files containing multiple bank statements and separates them into individual files. Built with LangChain and LangGraph for robust stateful AI processing.

🚀 Features

  • AI-Powered Analysis: Uses advanced language models to detect statement boundaries
  • Multiple LLM Support: Compatible with OpenAI GPT models and Ollama local models
  • PDF Processing: Efficient document manipulation using PyMuPDF
  • Metadata Extraction: Automatically extracts account numbers, dates, and bank information
  • File Organization: Generates meaningful filenames following configurable patterns
  • Error Handling: Comprehensive logging and audit trails
  • Security Controls: Built-in safeguards for production use
  • Paperless Integration: Optional integration with Paperless-ngx for document management

📋 Requirements

  • Python 3.9+
  • OpenAI API key (for LLM functionality)
  • UV package manager

🛠 Installation

1. Clone the Repository

git clone https://github.com/madeinoz67/bank-statement-separator.git
cd bank-statement-separator

2. Install Dependencies

# Install with UV
uv sync

# Install with dev dependencies
uv sync --group dev

3. Configure Environment

Copy the example environment file and configure your settings:

cp .env.example .env

Edit .env to set your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

📖 Usage

Basic Usage

# Process a single PDF file
uv run python -m src.bank_statement_separator.main input.pdf

# Specify output directory
uv run python -m src.bank_statement_separator.main input.pdf -o ./output

# Use verbose logging
uv run python -m src.bank_statement_separator.main input.pdf --verbose

# Dry run mode (no files written)
uv run python -m src.bank_statement_separator.main input.pdf --dry-run

Advanced Options

# Specify LLM model
uv run python -m src.bank_statement_separator.main input.pdf --model gpt-4o

# Set custom processing limits
uv run python -m src.bank_statement_separator.main input.pdf --max-pages 50

# Enable debug mode
uv run python -m src.bank_statement_separator.main input.pdf --debug

Configuration

The application uses environment variables for configuration. Key settings include:

  • OPENAI_API_KEY: Your OpenAI API key
  • OLLAMA_BASE_URL: Ollama server URL (for local models)
  • LOG_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, ERROR)
  • MAX_PAGES_PER_STATEMENT: Processing limits
  • OUTPUT_DIR: Default output directory

See Configuration Guide for complete details.

🏗 Architecture

The system consists of several key components:

  • Workflow Engine: LangGraph-based state machine for processing steps
  • LLM Analyzer: AI-powered boundary detection and metadata extraction
  • PDF Processor: Document manipulation and text extraction
  • Error Handler: Comprehensive error management and recovery
  • Rate Limiter: API usage controls and backoff mechanisms

Processing Pipeline

  1. PDF Ingestion: Load and validate input documents
  2. Document Analysis: Extract text and structural information
  3. Statement Detection: AI boundary detection using LLM analysis
  4. Metadata Extraction: Account and period information extraction
  5. PDF Generation: Create individual statement files
  6. File Organization: Apply naming conventions and organization

🧪 Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src

# Run specific test categories
uv run pytest tests/unit/
uv run pytest tests/integration/

🤝 Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Install development dependencies: uv sync --group dev
  4. Make your changes
  5. Run tests: uv run pytest
  6. Format code: uv run ruff format .
  7. Check linting: uv run ruff check .
  8. Commit your changes with a descriptive message
  9. Push to your fork and create a pull request

Code Quality

  • Follow PEP 8 style guidelines
  • Use type hints for all function parameters and return values
  • Write comprehensive docstrings for public APIs
  • Add tests for new features and bug fixes
  • Keep functions focused and small
  • Use descriptive variable and function names

Pull Request Process

  1. Ensure all tests pass
  2. Update documentation if needed
  3. Add appropriate commit trailers (see below)
  4. Request review from maintainers

Commit Guidelines

For commits fixing bugs or adding features based on user reports:

git commit --trailer "Reported-by:<name>"

For commits related to a GitHub issue:

git commit --trailer "Github-Issue:#<number>"

📚 Documentation

📖 Read the full documentation online

Complete documentation is available in the docs/ directory:

Build documentation locally:

uv run mkdocs serve

📦 Dependencies

Core Dependencies

  • langchain: LLM integration framework
  • langgraph: Stateful workflow orchestration
  • pymupdf: PDF processing
  • pydantic: Data validation
  • rich: Terminal formatting
  • python-dotenv: Environment management

Development Dependencies

  • pytest: Testing framework
  • ruff: Code formatting and linting
  • pyright: Type checking
  • mkdocs: Documentation generation

See pyproject.toml for complete dependency list.

🔒 Security

  • API keys are managed through environment variables
  • Input validation on all user-provided data
  • Rate limiting for external API calls
  • Comprehensive logging for audit trails
  • No sensitive data stored in application logs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with LangChain and LangGraph
  • PDF processing powered by PyMuPDF
  • Inspired by the need for automated document processing in financial workflows

🐛 Issues & Support


Note: This tool requires an OpenAI API key for AI functionality. Falls back to pattern matching if LLM is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bank_statement_separator-0.1.5.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bank_statement_separator-0.1.5-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file bank_statement_separator-0.1.5.tar.gz.

File metadata

  • Download URL: bank_statement_separator-0.1.5.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bank_statement_separator-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b084c1fa1bd5865d70c5f46566392083125e146de8d9f7ae004abb1b18020f49
MD5 15bf570dc8e902709fd7dde883067ec2
BLAKE2b-256 830783a00e6b35ea848d54244bf53884e1c43a6f00bcad41f7f44c0c365f428b

See more details on using hashes here.

File details

Details for the file bank_statement_separator-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for bank_statement_separator-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c765181ffa70b7afbb8838837de0847dc5785d5c1bc8c936374ac52039c49c31
MD5 43208ba3ec7d17aff2ab12f26cacc642
BLAKE2b-256 57b79e2b912ad5810ff3023061b30cf3ea62dd81f4220ca83aff49f08f0254a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page