Pure ETL pipeline for financial document processing - extracts data without analytical assumptions
Project description
๐ง Pure Financial Document ETL Pipeline
A data engineering focused ETL pipeline that extracts from unstructured financial document data (pdf, spreadsheet, doc) without making analytical assumptions or classifications.
๐ฏ Pure ETL Philosophy
This pipeline follows separation of concerns principles:
- ETL Layer: Extract โ Structure โ Export (no analysis)
- Analysis Layer: Separate downstream processing for ML/NLP
๐ฆ Installation
From PyPI
pip install boe-etl
From Source
git clone https://github.com/daleparr/boe-etl.git
cd boe-etl
pip install -e .
With Optional Dependencies
# For web frontend
pip install boe-etl[frontend]
# For development
pip install boe-etl[dev]
# All dependencies
pip install boe-etl[all]
๐ Quick Start
Installation
pip install boe-etl
Launch Web Interface
boe-etl frontend
Access Application
Open your browser to: http://localhost:8501
๐ What This ETL Does
โ Raw Data Extraction
- PDF Documents: Text extraction from earnings reports
- Excel Files: Multi-sheet data extraction
- Text Files: Direct text processing
- Sentence Segmentation: Clean, structured sentences
โ Raw Feature Extraction
all_financial_terms: Financial vocabulary found (no classification)financial_figures: Numbers and amounts extracted (no interpretation)temporal_indicators: Time-related language (no actual/projection labels)speaker_raw: Speaker patterns identified (no role classification)
โ Factual Boolean Flags
has_financial_terms: Terms present or nothas_financial_figures: Figures present or nothas_temporal_language: Temporal words present or nothas_speaker_identified: Speaker pattern found or not
๐ซ What This ETL Does NOT Do
โ No Analytical Assumptions
- No topic classification (Revenue & Growth, Risk Management, etc.)
- No actual vs projection classification
- No financial content relevance scoring
- No speaker role interpretation (CEO, CFO, Analyst)
โ No Machine Learning
- No topic modeling
- No sentiment analysis
- No content classification
- No predictive features
๐ Output Schema
Core Fields
source_file,institution,quarter,sentence_id,speaker_raw,text,source_type
Raw Extraction Fields
all_financial_terms,financial_figures,financial_figures_text,temporal_indicators
Factual Flags
has_financial_terms,has_financial_figures,has_temporal_language,has_speaker_identified
Metadata
word_count,char_count,processing_date,extraction_timestamp
๐ Example Processing
Input Document
"We reported revenue of $2.5 billion this quarter, up 15% from last year."
Pure ETL Output
all_financial_terms: "revenue"
financial_figures: "$2.5 billion|15%"
temporal_indicators: "reported|this quarter|last year"
has_financial_terms: True
has_financial_figures: True
has_temporal_language: True
No Analysis Applied
- โ No classification as "actual" vs "projection"
- โ No topic assignment like "Revenue & Growth"
- โ No sentiment or relevance scoring
๐๏ธ Architecture
Pure ETL Pipeline
Documents โ Extract Text โ Segment Sentences โ Extract Raw Features โ Export CSV
Downstream Analysis (Separate)
Raw CSV โ Topic Modeling โ Classification โ Feature Engineering โ ML Models
๐ ๏ธ Technical Features
โ Missing Value Handling
- No null values in output
- Consistent defaults for all fields
- NLP-ready datasets
โ Multi-Format Support
- PDF processing with PyPDF2
- Excel processing with openpyxl
- Text file processing
- Graceful error handling
โ Web Interface
- Drag-and-drop file upload
- Multi-institution processing
- Progress tracking
- Processing history
- CSV download
๐ Use Cases
Financial Institutions
- Earnings call transcript processing
- Financial report data extraction
- Regulatory document structuring
- Multi-quarter analysis preparation
Research & Analytics
- Academic financial research
- Market analysis preparation
- NLP model training data
- Time series analysis datasets
Compliance & Audit
- Document processing audit trails
- Structured data for compliance
- Historical document analysis
- Regulatory reporting preparation
๐ง Development
Project Structure
boe-etl/
โโโ boe_etl/
โ โโโ core.py # Main ETL pipeline
โ โโโ parsers/ # Document parsers
โ โโโ schema.py # Data standardization
โ โโโ frontend.py # Web interface
โโโ setup.py # Package configuration
โโโ requirements.txt # Dependencies
โโโ README.md # This file
Dependencies
- streamlit: Web interface framework
- pandas: Data manipulation
- PyPDF2: PDF text extraction
- openpyxl: Excel file processing
๐ท๏ธ Naming Convention
Output files follow the standardized format:
{Institution}_{Quarter}_{Year}_PureETL_{User}_{Timestamp}
Example: JPMorgan_Q1_2025_PureETL_JohnSmith_20250526_143022.csv
๐ License
MIT License - see LICENSE file for details.
๐ค Contributing
This is a pure ETL pipeline focused on data engineering principles. Contributions should maintain the separation between extraction and analysis.
Guidelines
- Keep ETL layer free of analytical assumptions
- Maintain raw data extraction focus
- Preserve downstream analysis flexibility
- Follow data engineering best practices
๐ Support
- Issues: GitHub Issues
- Documentation: GitHub Wiki
- Email: etl-team@bankofengland.co.uk
๐ฏ Philosophy: Extract, Don't Analyze
This pipeline embodies the principle that ETL should extract and structure data, not make analytical decisions. Analysis belongs in separate, specialized pipelines that can evolve independently.
Pure ETL = Maximum Flexibility for Downstream Analysis ๐ง
Version: 1.0.0
Author: Bank of England ETL Team
Repository: https://github.com/daleparr/boe-etl
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file boe_etl-1.0.0.tar.gz.
File metadata
- Download URL: boe_etl-1.0.0.tar.gz
- Upload date:
- Size: 74.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7e49fb370fb0589310e2eaa59b1ab43140400e6cc0f28f8d58868e7e15f7f6c
|
|
| MD5 |
45c499f964710a52cc2bd7d408131f56
|
|
| BLAKE2b-256 |
98a7375f07511b4f10eafded931c73e3dda724c500ec75e4903cf4058e8494d8
|
File details
Details for the file boe_etl-1.0.0-py3-none-any.whl.
File metadata
- Download URL: boe_etl-1.0.0-py3-none-any.whl
- Upload date:
- Size: 89.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e31c7c99cd3572206e942cd2a67712ef5d54d4c165b022c824df282e1ed82df
|
|
| MD5 |
c1ef8d9eb7bddb045d3420ee6c403a23
|
|
| BLAKE2b-256 |
4e4303e8a637a54f0d214f9451161f82219b9819b9436cde5c6ce1155a9ffe1e
|