Extract quarterly EPS estimates from FactSet Earnings Insight reports using OCR
Project description
EPS Estimates Collector
A unified Python package for extracting quarterly EPS (Earnings Per Share) estimates from FactSet Earnings Insight reports using OCR and image processing techniques.
⚠️ Disclaimer: This package is for educational and research purposes only. For production use, please use FactSet's official API. This package processes publicly available PDF reports and is not affiliated with or endorsed by FactSet.
Overview
This project processes chart images containing S&P 500 quarterly EPS data and extracts quarter labels (e.g., Q1'14, Q2'15) and corresponding EPS values. The extracted data is saved in CSV format for further analysis.
Motivation
Financial data providers (FactSet, Bloomberg, Investing.com, etc.) typically offer historical EPS data as actual values—once a quarter's earnings are reported, the estimate is overwritten with the actual figure. This creates a challenge for backtesting predictive models: using historical data means testing against information that was already reflected in stock prices at the time, making it difficult to evaluate the true predictive power of EPS estimates.
To address this, this project extracts point-in-time EPS estimates from historical FactSet Earnings Insight reports. By preserving the estimates as they appeared at each report date (before actual earnings were announced), a dataset can be built that accurately reflects what was known and expected at each point in time, enabling more meaningful backtesting and predictive analysis.
Project Structure
eps-estimates-collector/
├── src/eps_estimates_collector/
│ ├── core/ # Data collection
│ │ ├── downloader.py # PDF download
│ │ ├── extractor.py # Chart extraction
│ │ └── ocr/ # OCR processing
│ │ ├── processor.py # Main pipeline
│ │ ├── google_vision_processor.py
│ │ ├── parser.py
│ │ ├── bar_classifier.py
│ │ └── coordinate_matcher.py
│ ├── analysis/ # P/E ratio calculation
│ │ └── pe_ratio.py
│ └── utils/ # Cloud storage
│ ├── cloudflare.py # R2 operations
│ └── csv_storage.py # CSV I/O
├── scripts/data_collection/ # CLI scripts
├── actions/workflow.py # GitHub Actions
└── pyproject.toml
Installation
Option 1: Install from Git (Recommended)
# Install with uv
uv pip install git+https://github.com/seung-gu/eps-estimates-collector.git
# Or with pip
pip install git+https://github.com/seung-gu/eps-estimates-collector.git
Option 2: Local Development
# Clone repository
git clone https://github.com/seung-gu/eps-estimates-collector.git
cd eps-estimates-collector
# Install with uv
uv sync
# Or install in editable mode
uv pip install -e .
Requirements
-
Google Cloud Vision API (Required):
- Create service account and download JSON key
- Set
GOOGLE_APPLICATION_CREDENTIALSenvironment variable - Setup Guide
-
Cloudflare R2 (Optional - CI/CD only):
- For GitHub Actions workflow only
- Install:
uv sync --extra r2
Usage
Python API
from eps_estimates_collector import calculate_pe_ratio
# Calculate P/E ratios (auto-loads CSV from public URL)
pe_df = calculate_pe_ratio(
price_data={'2024-01-15': 150.5, '2024-02-15': 152.3},
type='forward'
)
print(pe_df)
P/E Types:
forward: Q[1:5] - Next 4 quarters (skip current)mix: Q[0:4] - Current + next 3 quarterstrailing-like: Q[-3:1] - Last 3 + current quarter
Architecture
Overview
┌─────────────────────────────────────────────────────────────────┐
│ 📦 Storage Structure │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 📦 Public Bucket (R2_PUBLIC_BUCKET_NAME) │
│ ├── extracted_estimates.csv ← Public URL (no auth) │
│ └── extracted_estimates_confidence.csv │
│ │
│ 🔒 Private Bucket (R2_BUCKET_NAME) │
│ ├── reports/*.pdf ← API key required │
│ └── estimates/*.png ← API key required │
└─────────────────────────────────────────────────────────────────┘
User Flow 1: API Users (Read-only)
┌──────────────────────────────────────────────────────────────────┐
│ Python Script │
├──────────────────────────────────────────────────────────────────┤
│ │
│ from eps_estimates_collector import calculate_pe_ratio │
│ │
│ pe_df = calculate_pe_ratio( │
│ price_data={'2024-01-15': 150.5}, │
│ type='forward' │
│ ) │
│ │ │
│ ├─ read_csv_from_cloud("extracted_estimates.csv") │
│ │ │ │
│ │ └─ GET https://pub-xxx.r2.dev/extracted_estimates.csv │
│ │ ↑ │
│ │ └─ ✅ No API key needed (public URL) │
│ │ │
│ └─ Calculate P/E ratios → Return DataFrame │
└──────────────────────────────────────────────────────────────────┘
Features:
- ✅ No API keys required
- ✅ Always loads latest data
- ✅ No local files needed
User Flow 2: GitHub Actions Workflow (Read/Write)
┌─────────────────────────────────────────────────────────────────┐
│ Workflow Steps │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Check last date │
│ read_csv_from_cloud("extracted_estimates.csv") │
│ → GET public URL │
│ → Get last Report_Date │
│ │
│ Step 2: Download new PDFs │
│ download_pdfs(start_date=last_date) │
│ → FactSet website │
│ → Save to local (temp) │
│ │
│ Step 3: Extract charts │
│ extract_charts(pdfs) │
│ → PDF → PNG │
│ → Save to local (temp) │
│ │
│ Step 4: Process images │
│ process_images(directory) │
│ ├─ read_csv_from_cloud() ← Load existing CSV │
│ ├─ OCR processing │
│ ├─ Merge existing + new data │
│ └─ Return DataFrame (don't save locally) │
│ │
│ Step 5: Upload results │
│ ├─ write_csv_to_cloud(df, "extracted_estimates.csv") │
│ │ → PUT to public bucket (with API key) │
│ │ → Accessible via public URL │
│ │ │
│ └─ upload_to_cloud(pdfs/pngs) │
│ → PUT to private bucket (with API key) │
│ → Only accessible with API key │
└─────────────────────────────────────────────────────────────────┘
Features:
- ✅ Reads from public URL (existing data)
- ✅ Writes to public bucket (CSV) with API key
- ✅ Writes to private bucket (PDF/PNG) with API key
- ✅ Appends new data (no overwrite)
Environment Variables
# API Users
# → No setup needed (public URL hardcoded)
# GitHub Actions Workflow
R2_BUCKET_NAME=factset-data # 🔒 Private bucket
R2_PUBLIC_BUCKET_NAME=factset-public # 📦 Public bucket
R2_ACCOUNT_ID=xxx
R2_ACCESS_KEY_ID=xxx
R2_SECRET_ACCESS_KEY=xxx
CI=true
Data Format
Main CSV (extracted_estimates.csv)
| Report_Date | Q4'13 | Q1'14 | Q2'14 | ... |
|---|---|---|---|---|
| 2016-12-09 | 24.89 | 26.23 | 27.45 | ... |
| 2016-12-16 | 24.89 | 26.25 | 27.48 | ... |
- Report_Date: FactSet report date (YYYY-MM-DD)
- Quarters: EPS estimates in dollars
- Public URL:
https://pub-62707afd3ebb422aae744c63c49d36a0.r2.dev/extracted_estimates.csv
Confidence CSV
Same structure, contains OCR confidence scores (0-1).
API Reference
calculate_pe_ratio(price_data, type='forward', output_csv=None)
Calculate P/E ratios from EPS estimates.
Parameters:
price_data(DataFrame | dict | None):- DataFrame: columns
Date,Price - Dict:
{'2024-01-15': 150.5, ...} - None: Returns template
- DataFrame: columns
type(str):'forward','mix', or'trailing-like'output_csv(Path, optional): Save results
Returns: DataFrame with P/E ratios
Example:
from eps_estimates_collector import calculate_pe_ratio
pe_df = calculate_pe_ratio(
price_data={'2024-01-15': 150.5},
type='forward',
output_csv='pe_ratios.csv'
)
GitHub Actions
Setup Secrets
Settings → Secrets → Actions:
GOOGLE_APPLICATION_CREDENTIALS_JSON
R2_BUCKET_NAME
R2_PUBLIC_BUCKET_NAME
R2_ACCOUNT_ID
R2_ACCESS_KEY_ID
R2_SECRET_ACCESS_KEY
Workflow
- Schedule: Every Monday 00:00 UTC
- Manual: GitHub Actions tab
- Steps:
- Check last report date (public URL)
- Download new PDFs
- Extract charts → Process with OCR
- Upload to cloud (PDFs/PNGs → private, CSVs → public)
Recent Updates
v0.3.0 (2025-11-19) - Cloud-First Architecture
- ✅ Cloud-first design: CSV data always from public URL
- ✅ Two-bucket strategy: Private (PDF/PNG) + Public (CSV)
- ✅ Simplified codebase: Removed local file logic
- ✅ Code cleanup: 45% reduction in csv_storage.py
- ✅ Better organization: Split functions by responsibility
- ✅ API-focused: Optimized for package users
v0.2.0 (2025-11-19)
- Unified package structure
- Code reduction (33%)
- P/E ratio calculation module
Technical Details
- OCR: Google Cloud Vision API (149 regions/image)
- Text Matching: Coordinate-based spatial algorithm
- Bar Classification: 3-method ensemble (100% agreement)
- Confidence Score: Bar classification (0.5) + consistency (0.5)
See DEVELOPMENT_LOG.md for detailed technical documentation.
Legal Disclaimer
This package is provided for educational and research purposes only.
- This package processes publicly available PDF reports from FactSet's website
- The data extraction and processing methods are implemented for academic research
- This package is NOT affiliated with, endorsed by, or sponsored by FactSet
- For production use, please use FactSet's official API
No Warranty: This software is provided "as is" without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement.
Limitation of Liability: In no event shall the authors or copyright holders be liable for any claim, damages, or other liability arising from the use of this software.
Data Usage: Users are responsible for ensuring compliance with FactSet's terms of service and any applicable data usage agreements when using this package.
License
MIT License - See LICENSE file for details.
Acknowledgments
- FactSet (Earnings Insight reports) - Official FactSet API
- Google Cloud Vision API
- Cloudflare R2
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eps_estimates_collector-0.2.0.tar.gz.
File metadata
- Download URL: eps_estimates_collector-0.2.0.tar.gz
- Upload date:
- Size: 7.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0159ef11659f3d259f80a81fd6d4b1b55b756a54111f9be64dc660889c89ba1a
|
|
| MD5 |
cb84662f91c66e5cf32c0424741d5fa0
|
|
| BLAKE2b-256 |
928a28933a1ec901a1b325aa3a57be2536cc70a1e0086963a74ac2c1c31a8a58
|
File details
Details for the file eps_estimates_collector-0.2.0-py3-none-any.whl.
File metadata
- Download URL: eps_estimates_collector-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a1e924d5f6eaaeefd06f93832636b41126195e6f56552d7c169f273e44026ea
|
|
| MD5 |
3346346088b9c507b590711c20ea13d8
|
|
| BLAKE2b-256 |
ba739d12792a05f7449b0f85ac60571c3f43d9b7d29223a5d461bea886ea28af
|