Extract quarterly EPS estimates from FactSet Earnings Insight reports using OCR

These details have not been verified by PyPI

Project description

EPS Estimates Collector

A unified Python package for extracting quarterly EPS (Earnings Per Share) estimates from FactSet Earnings Insight reports using OCR and image processing techniques.

⚠️ Disclaimer: This package is for educational and research purposes only. For production use, please use FactSet's official API. This package processes publicly available PDF reports and is not affiliated with or endorsed by FactSet.

Overview

This project processes chart images containing S&P 500 quarterly EPS data and extracts quarter labels (e.g., Q1'14, Q2'15) and corresponding EPS values. The extracted data is saved in CSV format for further analysis.

Motivation

Financial data providers (FactSet, Bloomberg, Investing.com, etc.) typically offer historical EPS data as actual values—once a quarter's earnings are reported, the estimate is overwritten with the actual figure. This creates a challenge for backtesting predictive models: using historical data means testing against information that was already reflected in stock prices at the time, making it difficult to evaluate the true predictive power of EPS estimates.

To address this, this project extracts point-in-time EPS estimates from historical FactSet Earnings Insight reports. By preserving the estimates as they appeared at each report date (before actual earnings were announced), a dataset can be built that accurately reflects what was known and expected at each point in time, enabling more meaningful backtesting and predictive analysis.

Project Structure

eps-estimates-collector/
├── src/eps_estimates_collector/
│   ├── core/                        # Data collection
│   │   ├── downloader.py            # PDF download
│   │   ├── extractor.py             # Chart extraction
│   │   └── ocr/                     # OCR processing
│   │       ├── processor.py         # Main pipeline
│   │       ├── google_vision_processor.py
│   │       ├── parser.py
│   │       ├── bar_classifier.py
│   │       └── coordinate_matcher.py
│   ├── analysis/                    # P/E ratio calculation
│   │   └── pe_ratio.py
│   └── utils/                       # Cloud storage
│       ├── cloudflare.py            # R2 operations
│       └── csv_storage.py           # CSV I/O
├── scripts/data_collection/         # CLI scripts
├── actions/workflow.py              # GitHub Actions
└── pyproject.toml

Installation

Option 1: Install from Git (Recommended)

# Install with uv
uv pip install git+https://github.com/seung-gu/eps-estimates-collector.git

# Or with pip
pip install git+https://github.com/seung-gu/eps-estimates-collector.git

Option 2: Local Development

# Clone repository
git clone https://github.com/seung-gu/eps-estimates-collector.git
cd eps-estimates-collector

# Install with uv
uv sync

# Or install in editable mode
uv pip install -e .

Requirements

Google Cloud Vision API (Required):
- Create service account and download JSON key
- Set GOOGLE_APPLICATION_CREDENTIALS environment variable
- Setup Guide
Cloudflare R2 (Optional - CI/CD only):
- For GitHub Actions workflow only
- Install: uv sync --extra r2

Usage

Python API

from eps_estimates_collector import calculate_pe_ratio

# Calculate P/E ratios (auto-loads CSV from public URL)
pe_df = calculate_pe_ratio(
    price_data={'2024-01-15': 150.5, '2024-02-15': 152.3},
    type='forward'
)
print(pe_df)

P/E Types:

forward: Q[1:5] - Next 4 quarters (skip current)
mix: Q[0:4] - Current + next 3 quarters
trailing-like: Q[-3:1] - Last 3 + current quarter

Architecture

Overview

┌─────────────────────────────────────────────────────────────────┐
│                      📦 Storage Structure                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  📦 Public Bucket (R2_PUBLIC_BUCKET_NAME)                       │
│     ├── extracted_estimates.csv          ← Public URL (no auth) │
│     └── extracted_estimates_confidence.csv                      │
│                                                                 │
│  🔒 Private Bucket (R2_BUCKET_NAME)                             │
│     ├── reports/*.pdf                    ← API key required     │
│     └── estimates/*.png                  ← API key required     │
└─────────────────────────────────────────────────────────────────┘

User Flow 1: API Users (Read-only)

┌──────────────────────────────────────────────────────────────────┐
│  Python Script                                                   │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  from eps_estimates_collector import calculate_pe_ratio           │
│                                                                  │
│  pe_df = calculate_pe_ratio(                                     │
│      price_data={'2024-01-15': 150.5},                           │
│      type='forward'                                              │
│  )                                                               │
│     │                                                            │
│     ├─ read_csv_from_cloud("extracted_estimates.csv")            │
│     │      │                                                     │
│     │      └─ GET https://pub-xxx.r2.dev/extracted_estimates.csv │
│     │            ↑                                               │
│     │            └─ ✅ No API key needed (public URL)            │
│     │                                                            │
│     └─ Calculate P/E ratios → Return DataFrame                   │
└──────────────────────────────────────────────────────────────────┘

Features:

✅ No API keys required
✅ Always loads latest data
✅ No local files needed

User Flow 2: GitHub Actions Workflow (Read/Write)

┌─────────────────────────────────────────────────────────────────┐
│  Workflow Steps                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step 1: Check last date                                        │
│     read_csv_from_cloud("extracted_estimates.csv")              │
│        → GET public URL                                         │
│        → Get last Report_Date                                   │
│                                                                 │
│  Step 2: Download new PDFs                                      │
│     download_pdfs(start_date=last_date)                         │
│        → FactSet website                                        │
│        → Save to local (temp)                                   │
│                                                                 │
│  Step 3: Extract charts                                         │
│     extract_charts(pdfs)                                        │
│        → PDF → PNG                                              │
│        → Save to local (temp)                                   │
│                                                                 │
│  Step 4: Process images                                         │
│     process_images(directory)                                   │
│        ├─ read_csv_from_cloud() ← Load existing CSV             │
│        ├─ OCR processing                                        │
│        ├─ Merge existing + new data                             │
│        └─ Return DataFrame (don't save locally)                 │
│                                                                 │
│  Step 5: Upload results                                         │
│     ├─ write_csv_to_cloud(df, "extracted_estimates.csv")        │
│     │     → PUT to public bucket (with API key)                 │
│     │     → Accessible via public URL                           │
│     │                                                           │
│     └─ upload_to_cloud(pdfs/pngs)                               │
│           → PUT to private bucket (with API key)                │
│           → Only accessible with API key                        │
└─────────────────────────────────────────────────────────────────┘

Features:

✅ Reads from public URL (existing data)
✅ Writes to public bucket (CSV) with API key
✅ Writes to private bucket (PDF/PNG) with API key
✅ Appends new data (no overwrite)

Environment Variables

# API Users
# → No setup needed (public URL hardcoded)

# GitHub Actions Workflow
R2_BUCKET_NAME=factset-data          # 🔒 Private bucket
R2_PUBLIC_BUCKET_NAME=factset-public # 📦 Public bucket
R2_ACCOUNT_ID=xxx
R2_ACCESS_KEY_ID=xxx
R2_SECRET_ACCESS_KEY=xxx
CI=true

Data Format

Main CSV (`extracted_estimates.csv`)

Report_Date	Q4'13	Q1'14	Q2'14	...
2016-12-09	24.89	26.23	27.45	...
2016-12-16	24.89	26.25	27.48	...

Report_Date: FactSet report date (YYYY-MM-DD)
Quarters: EPS estimates in dollars
Public URL: https://pub-62707afd3ebb422aae744c63c49d36a0.r2.dev/extracted_estimates.csv

Confidence CSV

Same structure, contains OCR confidence scores (0-1).

API Reference

`calculate_pe_ratio(price_data, type='forward', output_csv=None)`

Calculate P/E ratios from EPS estimates.

Parameters:

price_data (DataFrame | dict | None):
- DataFrame: columns Date, Price
- Dict: {'2024-01-15': 150.5, ...}
- None: Returns template
type (str): 'forward', 'mix', or 'trailing-like'
output_csv (Path, optional): Save results

Returns: DataFrame with P/E ratios

Example:

from eps_estimates_collector import calculate_pe_ratio

pe_df = calculate_pe_ratio(
    price_data={'2024-01-15': 150.5},
    type='forward',
    output_csv='pe_ratios.csv'
)

GitHub Actions

Setup Secrets

Settings → Secrets → Actions:

GOOGLE_APPLICATION_CREDENTIALS_JSON
R2_BUCKET_NAME
R2_PUBLIC_BUCKET_NAME
R2_ACCOUNT_ID
R2_ACCESS_KEY_ID
R2_SECRET_ACCESS_KEY

Workflow

Schedule: Every Monday 00:00 UTC
Manual: GitHub Actions tab
Steps:
1. Check last report date (public URL)
2. Download new PDFs
3. Extract charts → Process with OCR
4. Upload to cloud (PDFs/PNGs → private, CSVs → public)

Recent Updates

v0.3.0 (2025-11-19) - Cloud-First Architecture

✅ Cloud-first design: CSV data always from public URL
✅ Two-bucket strategy: Private (PDF/PNG) + Public (CSV)
✅ Simplified codebase: Removed local file logic
✅ Code cleanup: 45% reduction in csv_storage.py
✅ Better organization: Split functions by responsibility
✅ API-focused: Optimized for package users

v0.2.0 (2025-11-19)

Unified package structure
Code reduction (33%)
P/E ratio calculation module

Technical Details

OCR: Google Cloud Vision API (149 regions/image)
Text Matching: Coordinate-based spatial algorithm
Bar Classification: 3-method ensemble (100% agreement)
Confidence Score: Bar classification (0.5) + consistency (0.5)

See DEVELOPMENT_LOG.md for detailed technical documentation.

Legal Disclaimer

This package is provided for educational and research purposes only.

This package processes publicly available PDF reports from FactSet's website
The data extraction and processing methods are implemented for academic research
This package is NOT affiliated with, endorsed by, or sponsored by FactSet
For production use, please use FactSet's official API

No Warranty: This software is provided "as is" without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement.

Limitation of Liability: In no event shall the authors or copyright holders be liable for any claim, damages, or other liability arising from the use of this software.

Data Usage: Users are responsible for ensuring compliance with FactSet's terms of service and any applicable data usage agreements when using this package.

License

MIT License - See LICENSE file for details.

Acknowledgments

FactSet (Earnings Insight reports) - Official FactSet API
Google Cloud Vision API
Cloudflare R2

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.2

Nov 21, 2025

0.3.1

Nov 20, 2025

0.3.0

Nov 20, 2025

0.2.5

Nov 20, 2025

0.2.4

Nov 20, 2025

0.2.3

Nov 19, 2025

0.2.2

Nov 19, 2025

0.2.1

Nov 19, 2025

This version

0.2.0

Nov 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eps_estimates_collector-0.2.0.tar.gz (7.5 MB view details)

Uploaded Nov 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eps_estimates_collector-0.2.0-py3-none-any.whl (32.3 kB view details)

Uploaded Nov 19, 2025 Python 3

File details

Details for the file eps_estimates_collector-0.2.0.tar.gz.

File metadata

Download URL: eps_estimates_collector-0.2.0.tar.gz
Upload date: Nov 19, 2025
Size: 7.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for eps_estimates_collector-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0159ef11659f3d259f80a81fd6d4b1b55b756a54111f9be64dc660889c89ba1a`
MD5	`cb84662f91c66e5cf32c0424741d5fa0`
BLAKE2b-256	`928a28933a1ec901a1b325aa3a57be2536cc70a1e0086963a74ac2c1c31a8a58`

See more details on using hashes here.

File details

Details for the file eps_estimates_collector-0.2.0-py3-none-any.whl.

File metadata

Download URL: eps_estimates_collector-0.2.0-py3-none-any.whl
Upload date: Nov 19, 2025
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for eps_estimates_collector-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a1e924d5f6eaaeefd06f93832636b41126195e6f56552d7c169f273e44026ea`
MD5	`3346346088b9c507b590711c20ea13d8`
BLAKE2b-256	`ba739d12792a05f7449b0f85ac60571c3f43d9b7d29223a5d461bea886ea28af`

See more details on using hashes here.

eps-estimates-collector 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

EPS Estimates Collector

Overview

Motivation

Project Structure

Installation

Option 1: Install from Git (Recommended)

Option 2: Local Development

Requirements

Usage

Python API

Architecture

Overview

User Flow 1: API Users (Read-only)

User Flow 2: GitHub Actions Workflow (Read/Write)

Environment Variables

Data Format

Main CSV (extracted_estimates.csv)

Confidence CSV

API Reference

calculate_pe_ratio(price_data, type='forward', output_csv=None)

GitHub Actions

Setup Secrets

Workflow

Recent Updates

v0.3.0 (2025-11-19) - Cloud-First Architecture

v0.2.0 (2025-11-19)

Technical Details

Legal Disclaimer

License

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Main CSV (`extracted_estimates.csv`)

`calculate_pe_ratio(price_data, type='forward', output_csv=None)`