AI-powered tool for separating multi-statement PDF files using LangChain and LangGraph
Project description
Bank Statement Separator
An AI-powered tool that automatically processes PDF files containing multiple bank statements and separates them into individual files. Built with LangChain and LangGraph for robust stateful AI processing.
🚀 Features
- AI-Powered Analysis: Uses advanced language models to detect statement boundaries
- Multiple LLM Support: Compatible with OpenAI GPT models and Ollama local models
- PDF Processing: Efficient document manipulation using PyMuPDF
- Metadata Extraction: Automatically extracts account numbers, dates, and bank information
- File Organization: Generates meaningful filenames following configurable patterns
- Error Handling: Comprehensive logging and audit trails
- Error Detection & Tagging: Automatic identification and tagging of processing issues (v0.3.0+)
- Security Controls: Built-in safeguards for production use
- Paperless Integration: Optional integration with Paperless-ngx for document management
📋 Requirements
- Python 3.11+
- OpenAI API key (for LLM functionality)
- UV package manager
- Optional: Paperless-ngx instance (for document management and error tagging)
🛠 Installation
1. Clone the Repository
git clone https://github.com/madeinoz67/bank-statement-separator.git
cd bank-statement-separator
2. Install Dependencies
# Install with UV
uv sync
# Install with dev dependencies
uv sync --group dev
3. Configure Environment
Copy the example environment file and configure your settings:
cp .env.example .env
Edit .env to set your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
📖 Usage
Basic Usage
# Process a single PDF file
uv run python -m src.bank_statement_separator.main input.pdf
# Specify output directory
uv run python -m src.bank_statement_separator.main input.pdf -o ./output
# Use verbose logging
uv run python -m src.bank_statement_separator.main input.pdf --verbose
# Dry run mode (no files written)
uv run python -m src.bank_statement_separator.main input.pdf --dry-run
Advanced Options
# Specify LLM model
uv run python -m src.bank_statement_separator.main input.pdf --model gpt-4o
# Set custom processing limits
uv run python -m src.bank_statement_separator.main input.pdf --max-pages 50
# Enable debug mode
uv run python -m src.bank_statement_separator.main input.pdf --debug
Configuration
The application uses environment variables for configuration. Key settings include:
OPENAI_API_KEY: Your OpenAI API keyOLLAMA_BASE_URL: Ollama server URL (for local models)LOG_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, ERROR)MAX_PAGES_PER_STATEMENT: Processing limitsOUTPUT_DIR: Default output directory
Error Detection Configuration (v0.3.0+)
# Enable error detection and tagging
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS="error:processing,needs:review"
PAPERLESS_ERROR_SEVERITY_LEVELS="medium,high,critical"
See Configuration Guide for complete details.
🏗 Architecture
The system consists of several key components:
- Workflow Engine: LangGraph-based state machine for processing steps
- LLM Analyzer: AI-powered boundary detection and metadata extraction
- PDF Processor: Document manipulation and text extraction
- Error Handler: Comprehensive error management and recovery
- Error Detection System: Automatic identification and tagging of processing issues (v0.3.0+)
- Rate Limiter: API usage controls and backoff mechanisms
Processing Pipeline
- PDF Ingestion: Load and validate input documents
- Document Analysis: Extract text and structural information
- Statement Detection: AI boundary detection using LLM analysis
- Metadata Extraction: Account and period information extraction
- PDF Generation: Create individual statement files
- File Organization: Apply naming conventions and organization
- Paperless Upload: Optional document management integration
- Error Detection: Automatic identification and tagging of processing issues (v0.3.0+)
🧪 Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src
# Run specific test categories
uv run pytest tests/unit/
uv run pytest tests/integration/
# Test error detection and tagging (v0.3.0+)
uv run pytest tests/unit/test_error_tagging*.py -v
uv run python tests/manual/test_error_tagging_e2e.py
🤝 Contributing
We welcome contributions! Please follow these guidelines:
Development Setup
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Install development dependencies:
uv sync --group dev - Make your changes
- Run tests:
uv run pytest - Format code:
uv run ruff format . - Check linting:
uv run ruff check . - Commit your changes with a descriptive message
- Push to your fork and create a pull request
Code Quality
- Follow PEP 8 style guidelines
- Use type hints for all function parameters and return values
- Write comprehensive docstrings for public APIs
- Add tests for new features and bug fixes
- Keep functions focused and small
- Use descriptive variable and function names
Pull Request Process
- Ensure all tests pass
- Update documentation if needed
- Add appropriate commit trailers (see below)
- Request review from maintainers
Commit Guidelines
For commits fixing bugs or adding features based on user reports:
git commit --trailer "Reported-by:<name>"
For commits related to a GitHub issue:
git commit --trailer "Github-Issue:#<number>"
📚 Documentation
📖 Read the full documentation online
Complete documentation is available in the docs/ directory:
Build documentation locally:
uv run mkdocs serve
🔍 Error Detection & Tagging (v0.3.0+)
The system includes comprehensive error detection that automatically identifies processing issues and applies configurable tags for manual review:
Error Categories
- LLM Analysis Failures: API errors, model failures, fallback usage
- Boundary Detection Issues: Low confidence boundaries, suspicious patterns
- PDF Processing Errors: File corruption, access issues, format problems
- Metadata Extraction Failures: Missing account data, date parsing issues
- Validation Failures: Content validation, integrity checks
- File Output Issues: Write failures, permissions, disk space
Error Detection Setup
# Basic error detection setup
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS="error:processing,needs:review"
PAPERLESS_ERROR_SEVERITY_LEVELS="medium,high,critical"
# Advanced configuration
PAPERLESS_ERROR_TAG_THRESHOLD=0.7
PAPERLESS_ERROR_BATCH_TAGGING=true
See the Paperless Integration Guide for complete configuration details.
📦 Dependencies
Core Dependencies
langchain: LLM integration frameworklanggraph: Stateful workflow orchestrationpymupdf: PDF processingpydantic: Data validationrich: Terminal formattingpython-dotenv: Environment management
Development Dependencies
pytest: Testing frameworkruff: Code formatting and lintingpyright: Type checkingmkdocs: Documentation generation
See pyproject.toml for complete dependency list.
🔒 Security
- API keys are managed through environment variables
- Input validation on all user-provided data
- Rate limiting for external API calls
- Comprehensive logging for audit trails
- No sensitive data stored in application logs
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built with LangChain and LangGraph
- PDF processing powered by PyMuPDF
- Inspired by the need for automated document processing in financial workflows
🐛 Issues & Support
- Report bugs via GitHub Issues
- Check Troubleshooting Guide for common issues
- Review Known Issues for current limitations
Note: This tool requires an OpenAI API key for AI functionality. Falls back to pattern matching if LLM is unavailable.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bank_statement_separator-0.4.0.tar.gz.
File metadata
- Download URL: bank_statement_separator-0.4.0.tar.gz
- Upload date:
- Size: 87.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f31cb4706acf72aed8d5447bd5746364dbc2f09c5f97c3e5404291de3c3fc48f
|
|
| MD5 |
ceccfbfb4ccfa67f36de541a0c3c4c32
|
|
| BLAKE2b-256 |
e7994591ab823504e5e3871da290822c95b960fade0a51880b671b5132be2847
|
File details
Details for the file bank_statement_separator-0.4.0-py3-none-any.whl.
File metadata
- Download URL: bank_statement_separator-0.4.0-py3-none-any.whl
- Upload date:
- Size: 93.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd9522e3a510269463453e5076b06bb935e698d609a5eb989355ea0f67965228
|
|
| MD5 |
08a0fb3a6c43b8eaa7c8c8c576607135
|
|
| BLAKE2b-256 |
d013b13749d462c90cdab2d1da3cf6193072a27efad85e81f52d5b1f4c4b0bdb
|