Pure ETL pipeline for financial document processing - extracts data without analytical assumptions

These details have not been verified by PyPI

Project links

Project description

🔧 Pure Financial Document ETL Pipeline

A data engineering focused ETL pipeline that extracts from unstructured financial document data (pdf, spreadsheet, doc) without making analytical assumptions or classifications.

🎯 Pure ETL Philosophy

This pipeline follows separation of concerns principles:

ETL Layer: Extract → Structure → Export (no analysis)
Analysis Layer: Separate downstream processing for ML/NLP

📦 Installation

From PyPI

pip install boe-etl

From Source

git clone https://github.com/daleparr/boe-etl.git
cd boe-etl
pip install -e .

With Optional Dependencies

# For web frontend
pip install boe-etl[frontend]

# For development
pip install boe-etl[dev]

# All dependencies
pip install boe-etl[all]

🚀 Quick Start

Installation

pip install boe-etl

Launch Web Interface

boe-etl frontend

Access Application

Open your browser to: http://localhost:8501

📊 What This ETL Does

✅ Raw Data Extraction

PDF Documents: Text extraction from earnings reports
Excel Files: Multi-sheet data extraction
Text Files: Direct text processing
Sentence Segmentation: Clean, structured sentences

✅ Raw Feature Extraction

all_financial_terms: Financial vocabulary found (no classification)
financial_figures: Numbers and amounts extracted (no interpretation)
temporal_indicators: Time-related language (no actual/projection labels)
speaker_raw: Speaker patterns identified (no role classification)

✅ Factual Boolean Flags

has_financial_terms: Terms present or not
has_financial_figures: Figures present or not
has_temporal_language: Temporal words present or not
has_speaker_identified: Speaker pattern found or not

🚫 What This ETL Does NOT Do

❌ No Analytical Assumptions

No topic classification (Revenue & Growth, Risk Management, etc.)
No actual vs projection classification
No financial content relevance scoring
No speaker role interpretation (CEO, CFO, Analyst)

❌ No Machine Learning

No topic modeling
No sentiment analysis
No content classification
No predictive features

📈 Output Schema

Core Fields

source_file,institution,quarter,sentence_id,speaker_raw,text,source_type

Raw Extraction Fields

all_financial_terms,financial_figures,financial_figures_text,temporal_indicators

Factual Flags

has_financial_terms,has_financial_figures,has_temporal_language,has_speaker_identified

Metadata

word_count,char_count,processing_date,extraction_timestamp

🔄 Example Processing

Input Document

"We reported revenue of $2.5 billion this quarter, up 15% from last year."

Pure ETL Output

all_financial_terms: "revenue"
financial_figures: "$2.5 billion|15%"
temporal_indicators: "reported|this quarter|last year"
has_financial_terms: True
has_financial_figures: True
has_temporal_language: True

No Analysis Applied

❌ No classification as "actual" vs "projection"
❌ No topic assignment like "Revenue & Growth"
❌ No sentiment or relevance scoring

🏗️ Architecture

Pure ETL Pipeline

Documents → Extract Text → Segment Sentences → Extract Raw Features → Export CSV

Downstream Analysis (Separate)

Raw CSV → Topic Modeling → Classification → Feature Engineering → ML Models

🛠️ Technical Features

✅ Missing Value Handling

No null values in output
Consistent defaults for all fields
NLP-ready datasets

✅ Multi-Format Support

PDF processing with PyPDF2
Excel processing with openpyxl
Text file processing
Graceful error handling

✅ Web Interface

Drag-and-drop file upload
Multi-institution processing
Progress tracking
Processing history
CSV download

📚 Use Cases

Financial Institutions

Earnings call transcript processing
Financial report data extraction
Regulatory document structuring
Multi-quarter analysis preparation

Research & Analytics

Academic financial research
Market analysis preparation
NLP model training data
Time series analysis datasets

Compliance & Audit

Document processing audit trails
Structured data for compliance
Historical document analysis
Regulatory reporting preparation

🔧 Development

Project Structure

boe-etl/
├── boe_etl/
│   ├── core.py                # Main ETL pipeline
│   ├── parsers/              # Document parsers
│   ├── schema.py             # Data standardization
│   └── frontend.py           # Web interface
├── setup.py                  # Package configuration
├── requirements.txt          # Dependencies
└── README.md                 # This file

Dependencies

streamlit: Web interface framework
pandas: Data manipulation
PyPDF2: PDF text extraction
openpyxl: Excel file processing

🏷️ Naming Convention

Output files follow the standardized format:

{Institution}_{Quarter}_{Year}_PureETL_{User}_{Timestamp}

Example: JPMorgan_Q1_2025_PureETL_JohnSmith_20250526_143022.csv

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

This is a pure ETL pipeline focused on data engineering principles. Contributions should maintain the separation between extraction and analysis.

Guidelines

Keep ETL layer free of analytical assumptions
Maintain raw data extraction focus
Preserve downstream analysis flexibility
Follow data engineering best practices

🆘 Support

Issues: GitHub Issues
Documentation: GitHub Wiki
Email: etl-team@bankofengland.co.uk

🎯 Philosophy: Extract, Don't Analyze

This pipeline embodies the principle that ETL should extract and structure data, not make analytical decisions. Analysis belongs in separate, specialized pipelines that can evolve independently.

Pure ETL = Maximum Flexibility for Downstream Analysis 🔧

Version: 1.0.0
Author: Bank of England ETL Team
Repository: https://github.com/daleparr/boe-etl

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

May 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boe_etl-1.0.0.tar.gz (74.2 kB view details)

Uploaded May 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

boe_etl-1.0.0-py3-none-any.whl (89.7 kB view details)

Uploaded May 29, 2025 Python 3

File details

Details for the file boe_etl-1.0.0.tar.gz.

File metadata

Download URL: boe_etl-1.0.0.tar.gz
Upload date: May 29, 2025
Size: 74.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for boe_etl-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e7e49fb370fb0589310e2eaa59b1ab43140400e6cc0f28f8d58868e7e15f7f6c`
MD5	`45c499f964710a52cc2bd7d408131f56`
BLAKE2b-256	`98a7375f07511b4f10eafded931c73e3dda724c500ec75e4903cf4058e8494d8`

See more details on using hashes here.

File details

Details for the file boe_etl-1.0.0-py3-none-any.whl.

File metadata

Download URL: boe_etl-1.0.0-py3-none-any.whl
Upload date: May 29, 2025
Size: 89.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for boe_etl-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e31c7c99cd3572206e942cd2a67712ef5d54d4c165b022c824df282e1ed82df`
MD5	`c1ef8d9eb7bddb045d3420ee6c403a23`
BLAKE2b-256	`4e4303e8a637a54f0d214f9451161f82219b9819b9436cde5c6ce1155a9ffe1e`

See more details on using hashes here.

boe-etl 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔧 Pure Financial Document ETL Pipeline

🎯 Pure ETL Philosophy

📦 Installation

From PyPI

From Source

With Optional Dependencies

🚀 Quick Start

Installation

Launch Web Interface

Access Application

📊 What This ETL Does

✅ Raw Data Extraction

✅ Raw Feature Extraction

✅ Factual Boolean Flags

🚫 What This ETL Does NOT Do

❌ No Analytical Assumptions

❌ No Machine Learning

📈 Output Schema

Core Fields

Raw Extraction Fields

Factual Flags

Metadata

🔄 Example Processing

Input Document

Pure ETL Output

No Analysis Applied

🏗️ Architecture

Pure ETL Pipeline

Downstream Analysis (Separate)

🛠️ Technical Features

✅ Missing Value Handling

✅ Multi-Format Support

✅ Web Interface

📚 Use Cases

Financial Institutions

Research & Analytics

Compliance & Audit

🔧 Development

Project Structure

Dependencies

🏷️ Naming Convention

📄 License

🤝 Contributing

Guidelines

🆘 Support

🎯 Philosophy: Extract, Don't Analyze

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes