Skip to main content

Pure ETL pipeline for financial document processing - extracts data without analytical assumptions

Project description

๐Ÿ”ง Pure Financial Document ETL Pipeline

A data engineering focused ETL pipeline that extracts from unstructured financial document data (pdf, spreadsheet, doc) without making analytical assumptions or classifications.

๐ŸŽฏ Pure ETL Philosophy

This pipeline follows separation of concerns principles:

  • ETL Layer: Extract โ†’ Structure โ†’ Export (no analysis)
  • Analysis Layer: Separate downstream processing for ML/NLP

๐Ÿ“ฆ Installation

From PyPI

pip install boe-etl

From Source

git clone https://github.com/daleparr/boe-etl.git
cd boe-etl
pip install -e .

With Optional Dependencies

# For web frontend
pip install boe-etl[frontend]

# For development
pip install boe-etl[dev]

# All dependencies
pip install boe-etl[all]

๐Ÿš€ Quick Start

Installation

pip install boe-etl

Launch Web Interface

boe-etl frontend

Access Application

Open your browser to: http://localhost:8501

๐Ÿ“Š What This ETL Does

โœ… Raw Data Extraction

  • PDF Documents: Text extraction from earnings reports
  • Excel Files: Multi-sheet data extraction
  • Text Files: Direct text processing
  • Sentence Segmentation: Clean, structured sentences

โœ… Raw Feature Extraction

  • all_financial_terms: Financial vocabulary found (no classification)
  • financial_figures: Numbers and amounts extracted (no interpretation)
  • temporal_indicators: Time-related language (no actual/projection labels)
  • speaker_raw: Speaker patterns identified (no role classification)

โœ… Factual Boolean Flags

  • has_financial_terms: Terms present or not
  • has_financial_figures: Figures present or not
  • has_temporal_language: Temporal words present or not
  • has_speaker_identified: Speaker pattern found or not

๐Ÿšซ What This ETL Does NOT Do

โŒ No Analytical Assumptions

  • No topic classification (Revenue & Growth, Risk Management, etc.)
  • No actual vs projection classification
  • No financial content relevance scoring
  • No speaker role interpretation (CEO, CFO, Analyst)

โŒ No Machine Learning

  • No topic modeling
  • No sentiment analysis
  • No content classification
  • No predictive features

๐Ÿ“ˆ Output Schema

Core Fields

source_file,institution,quarter,sentence_id,speaker_raw,text,source_type

Raw Extraction Fields

all_financial_terms,financial_figures,financial_figures_text,temporal_indicators

Factual Flags

has_financial_terms,has_financial_figures,has_temporal_language,has_speaker_identified

Metadata

word_count,char_count,processing_date,extraction_timestamp

๐Ÿ”„ Example Processing

Input Document

"We reported revenue of $2.5 billion this quarter, up 15% from last year."

Pure ETL Output

all_financial_terms: "revenue"
financial_figures: "$2.5 billion|15%"
temporal_indicators: "reported|this quarter|last year"
has_financial_terms: True
has_financial_figures: True
has_temporal_language: True

No Analysis Applied

  • โŒ No classification as "actual" vs "projection"
  • โŒ No topic assignment like "Revenue & Growth"
  • โŒ No sentiment or relevance scoring

๐Ÿ—๏ธ Architecture

Pure ETL Pipeline

Documents โ†’ Extract Text โ†’ Segment Sentences โ†’ Extract Raw Features โ†’ Export CSV

Downstream Analysis (Separate)

Raw CSV โ†’ Topic Modeling โ†’ Classification โ†’ Feature Engineering โ†’ ML Models

๐Ÿ› ๏ธ Technical Features

โœ… Missing Value Handling

  • No null values in output
  • Consistent defaults for all fields
  • NLP-ready datasets

โœ… Multi-Format Support

  • PDF processing with PyPDF2
  • Excel processing with openpyxl
  • Text file processing
  • Graceful error handling

โœ… Web Interface

  • Drag-and-drop file upload
  • Multi-institution processing
  • Progress tracking
  • Processing history
  • CSV download

๐Ÿ“š Use Cases

Financial Institutions

  • Earnings call transcript processing
  • Financial report data extraction
  • Regulatory document structuring
  • Multi-quarter analysis preparation

Research & Analytics

  • Academic financial research
  • Market analysis preparation
  • NLP model training data
  • Time series analysis datasets

Compliance & Audit

  • Document processing audit trails
  • Structured data for compliance
  • Historical document analysis
  • Regulatory reporting preparation

๐Ÿ”ง Development

Project Structure

boe-etl/
โ”œโ”€โ”€ boe_etl/
โ”‚   โ”œโ”€โ”€ core.py                # Main ETL pipeline
โ”‚   โ”œโ”€โ”€ parsers/              # Document parsers
โ”‚   โ”œโ”€โ”€ schema.py             # Data standardization
โ”‚   โ””โ”€โ”€ frontend.py           # Web interface
โ”œโ”€โ”€ setup.py                  # Package configuration
โ”œโ”€โ”€ requirements.txt          # Dependencies
โ””โ”€โ”€ README.md                 # This file

Dependencies

  • streamlit: Web interface framework
  • pandas: Data manipulation
  • PyPDF2: PDF text extraction
  • openpyxl: Excel file processing

๐Ÿท๏ธ Naming Convention

Output files follow the standardized format:

{Institution}_{Quarter}_{Year}_PureETL_{User}_{Timestamp}

Example: JPMorgan_Q1_2025_PureETL_JohnSmith_20250526_143022.csv

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿค Contributing

This is a pure ETL pipeline focused on data engineering principles. Contributions should maintain the separation between extraction and analysis.

Guidelines

  • Keep ETL layer free of analytical assumptions
  • Maintain raw data extraction focus
  • Preserve downstream analysis flexibility
  • Follow data engineering best practices

๐Ÿ†˜ Support

๐ŸŽฏ Philosophy: Extract, Don't Analyze

This pipeline embodies the principle that ETL should extract and structure data, not make analytical decisions. Analysis belongs in separate, specialized pipelines that can evolve independently.

Pure ETL = Maximum Flexibility for Downstream Analysis ๐Ÿ”ง


Version: 1.0.0
Author: Bank of England ETL Team
Repository: https://github.com/daleparr/boe-etl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boe_etl-1.0.0.tar.gz (74.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boe_etl-1.0.0-py3-none-any.whl (89.7 kB view details)

Uploaded Python 3

File details

Details for the file boe_etl-1.0.0.tar.gz.

File metadata

  • Download URL: boe_etl-1.0.0.tar.gz
  • Upload date:
  • Size: 74.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for boe_etl-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e7e49fb370fb0589310e2eaa59b1ab43140400e6cc0f28f8d58868e7e15f7f6c
MD5 45c499f964710a52cc2bd7d408131f56
BLAKE2b-256 98a7375f07511b4f10eafded931c73e3dda724c500ec75e4903cf4058e8494d8

See more details on using hashes here.

File details

Details for the file boe_etl-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: boe_etl-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 89.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for boe_etl-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e31c7c99cd3572206e942cd2a67712ef5d54d4c165b022c824df282e1ed82df
MD5 c1ef8d9eb7bddb045d3420ee6c403a23
BLAKE2b-256 4e4303e8a637a54f0d214f9451161f82219b9819b9436cde5c6ce1155a9ffe1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page