SEC filings and Earnings call transcripts data

Project description

SEC-filings-Markdown

Configuration

Settings are loaded via Pydantic Settings from environment variables or a .env file:

Variable	Description	Default
`SEC_API_ORGANIZATION`	Organization name for SEC API User-Agent	`Your-Organization`
`SEC_API_EMAIL`	Contact email for SEC API User-Agent	`your-email@example.com`
`OLMOCR_SERVER`	vLLM server URL for olmOCR	`http://localhost:8000/v1`
`OLMOCR_MODEL`	Model name for olmOCR	`allenai/olmOCR-2-7B-1025-FP8`
`OLMOCR_WORKSPACE`	Workspace directory for OCR output	`./localworkspace`
`EARNINGS_TRANSCRIPTS_DIR`	Directory for fetched transcript Markdown files	`earnings_transcripts_data`
`EMBEDDING_SERVER`	OpenAI-compatible embedding API (e.g. vLLM pooling)	`http://127.0.0.1:8888/v1`
`EMBEDDING_MODEL`	Model id passed to the embedding server	`Qwen/Qwen3-Embedding-0.6B`
`CHROMA_PERSIST_DIR`	ChromaDB persistence directory	`./chroma_db`
`MCP_HOST`	Bind address for the MCP HTTP server	`127.0.0.1`
`MCP_PORT`	Listen port for the MCP HTTP server	`8069`
`MCP_NGROK_ALLOWED_HOSTS`	JSON list of extra `Host` values allowed through the tunnel (see MCP section)	(see `finance_data/settings.py`)

MCP server

mcp_server.py exposes SEC filing and earnings-transcript workflows over MCP (fetch/OCR, embed, semantic search) using the same backends as the REST API: olmOCR and an OpenAI-compatible embedding endpoint backed by vLLM.

1. Start the vLLM backends

The MCP tools need both servers running before you start the MCP process.

Terminal A — olmOCR (vision / markdown pipeline) — must match OLMOCR_SERVER (default http://localhost:8000/v1):

make vllm-olmocr-serve

Terminal B — embeddings (pooling runner) — must match EMBEDDING_SERVER (default http://127.0.0.1:8888/v1):

make vllm-embd-serve

If you change PORT / EMBD_PORT in the Makefile or your environment, set OLMOCR_SERVER and EMBEDDING_SERVER in .env so they point at the same hosts and ports.

2. Install dependencies and run the MCP server

Chroma, OpenAI client, and OCR-related imports require the ocr-md group in addition to mcp:

uv sync --group ocr-md --group mcp
uv run --group ocr-md --group mcp python mcp_server.py

The server listens on MCP_HOST / MCP_PORT (defaults 127.0.0.1:8069) using the streamable HTTP transport. The HTTP endpoint path is /mcp (FastMCP default), so locally that is http://127.0.0.1:8069/mcp.

3. Expose with ngrok and connect a client

To use the MCP server from another machine or from a hosted MCP client, tunnel the MCP port with ngrok (or a similar HTTPS reverse proxy).

Install and log in to ngrok (ngrok config add-authtoken …).
With mcp_server.py still running, forward the MCP port (replace 8069 if you changed MCP_PORT):
```
ngrok http 8069
```
Note the public HTTPS hostname ngrok assigns (for example https://random-name.ngrok-free.app or *.ngrok-free.dev).
Add that hostname to MCP_NGROK_ALLOWED_HOSTS so DNS rebinding protection accepts the tunnel’s Host header. In .env, use a JSON array, for example:
```
MCP_NGROK_ALLOWED_HOSTS='["random-name.ngrok-free.app"]'
```
Restart mcp_server.py after changing this.
Point your MCP client at the tunneled URL including /mcp, for example:

https://random-name.ngrok-free.app/mcp

Use your client’s documented configuration for Streamable HTTP / URL-based MCP servers. If the tunnel hostname changes each time you run ngrok, update MCP_NGROK_ALLOWED_HOSTS and restart the MCP process.

Tools and resources

Tools (representative):

company_name_to_ticker_tool, list_resources_tool
sec_main_to_markdown_and_embed_tool, earnings_transcript_for_quarter_tool
search_sec_filings_tool, search_transcripts_tool

For an interactive walkthrough of how to use the MCP, open this ChatGPT chats.

Resources (URI catalogs under resource://sec-filings-data/...): combined SEC + transcript file listings and per-root trees.

Docker

Build

docker build -t sec-filings-md .

The image now defaults to a smaller footprint by using the CUDA runtime base while still preinstalling Playwright Chromium for scraping. If you want to skip Playwright browser installation (to reduce image size further), build with:

docker build --build-arg INSTALL_PLAYWRIGHT_BROWSER=0 -t sec-filings-md .

Or via Makefile:

make docker-build

Run

GPU_DEVICE=${GPU_DEVICE:-3}
docker run --gpus device=${GPU_DEVICE} \
  -e SEC_API_ORGANIZATION="Your-Organization" \
  -e SEC_API_EMAIL="your-email@example.com" \
  -v ./sec_data:/app/sec_data \
  -v ./localworkspace:/app/localworkspace \
  -p 8081:8081 \
  sec-filings-md

Or via Makefile (build + run in one step):

make docker-start

Makefile overrides:

Variable	Description	Default
`IMAGE_NAME`	Docker image name	`sec-filings-md`
`GPU_DEVICE`	GPU device index	`0`
`API_PORT`	Host port for API	`8081`
`SEC_API_ORGANIZATION`	SEC API User-Agent org	`Your-Organization`
`SEC_API_EMAIL`	SEC API contact email	`your-email@example.com`

Example with overrides:

make docker-start GPU_DEVICE=3 SEC_API_EMAIL="you@example.com"

The two volumes persist data across container restarts:

Volume	Container path	Purpose
`sec_data`	`/app/sec_data`	Downloaded SEC filing PDFs
`localworkspace`	`/app/localworkspace`	OCR workspace and output markdown

Override the workspace path at runtime with -e OLMOCR_WORKSPACE=/custom/path.

Installation

uv sync
playwright install chromium

Install OCR/markdown + embedding stack dependencies when you need those pipelines:

uv sync --group ocr-md

Package install (for publishing/consuming from PyPI):

pip install finance_data_llm

Use package functions directly from Python (no server process required):

import asyncio

from finance_data.filings.sec_data import sec_main
from finance_data.filings.utils import company_to_ticker

ticker = company_to_ticker("Amazon") or "AMZN"
sec_result, pdf_path = asyncio.run(
    sec_main(ticker=ticker, year="2025", filing_type="10-K")
)

If you do want to run the API, use the packaged console script:

finance-data-llm-server

Usage

Start vLLM server:

make vllm-olmocr-serve

Benchmark vLLM with guidellm (start the vLLM server first, then in another terminal):

make guidellm-benchmark

Fetch SEC filings:

uv run python -m finance_data.filings.sec_data --ticker AMZN --year 2025

Run OCR pipeline:

uv run python -m finance_data.ocr.olmocr_pipeline --pdf-dir sec_data/AMZN-2025

Earnings call transcripts

Transcripts are scraped from discountingcashflows.com (Playwright + Chromium). Each quarter is saved as one Markdown file under {EARNINGS_TRANSCRIPTS_DIR}/{TICKER}/{year}/Q{n}_{YYYY-MM-DD}.md (date may be unknown-date when unavailable).

1. Fetch transcripts

CLI (writes files under earnings_transcripts_data by default):

uv run python -m finance_data.earnings_transcripts.transcripts AMZN 2025

Optional: --max-concurrency (default 4) to limit parallel quarter fetches.

HTTP (same fetch + persist, with the API running):

curl -s -X POST "http://127.0.0.1:8081/earnings_transcripts/for_year" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":2025}'

Response body is a JSON array of transcript objects (ticker, year, quarter_num, date, speaker_texts, …).

2. Start embedding server and API

Transcript chunks are embedded with the same OpenAI-compatible embedding endpoint as SEC filings (EMBEDDING_SERVER / EMBEDDING_MODEL). In one terminal:

make vllm-embd-serve

In another:

make start-server

(Adjust API_PORT / EMBD_PORT in the Makefile or your environment if needed.)

3. Index transcripts in Chroma

curl -s -X POST "http://127.0.0.1:8081/vector_store/embed_transcripts" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":"2025","force":false}'

Use "force": true to replace existing vectors for those quarters. Filing types in the index appear as Q1–Q4.

4. Search across indexed quarters

Search merges hits from all transcript quarters present for that ticker/year:

curl -s -X POST "http://127.0.0.1:8081/vector_store/search_transcripts" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":"2025","query":"AWS revenue growth","top_k":5}'

Each result includes filing_type (Q1, …) so you can see which call the chunk came from.

Project details

Release history Release notifications | RSS feed

0.1.13

Apr 8, 2026

0.1.12

Apr 6, 2026

0.1.11

Apr 5, 2026

0.1.10

Apr 5, 2026

0.1.9

Apr 4, 2026

0.1.8

Apr 1, 2026

0.1.7

Apr 1, 2026

0.1.6

Apr 1, 2026

This version

0.1.5

Mar 29, 2026

0.1.4

Mar 27, 2026

0.1.3

Mar 25, 2026

0.1.2

Mar 25, 2026

0.1.1

Mar 23, 2026

0.1.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finance_data_llm-0.1.5.tar.gz (48.5 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

finance_data_llm-0.1.5-py3-none-any.whl (52.4 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file finance_data_llm-0.1.5.tar.gz.

File metadata

Download URL: finance_data_llm-0.1.5.tar.gz
Upload date: Mar 29, 2026
Size: 48.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for finance_data_llm-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`e2b95e552468c72abfe26563052a4deb6b7596a8e92de463fdbfaeb47e88db73`
MD5	`d0efa35e4748262aa363781b916916c7`
BLAKE2b-256	`5c4123da3304fea6979f0f549c7c9fbe568743ef6b9f8324d0b8b25d624d8803`

See more details on using hashes here.

File details

Details for the file finance_data_llm-0.1.5-py3-none-any.whl.

File metadata

Download URL: finance_data_llm-0.1.5-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 52.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for finance_data_llm-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5478f1d7ce352385526fe1778db18e825a5186c8aa295696520cfa19cc1b743`
MD5	`9146014c0840e328c9eff558fcf099d8`
BLAKE2b-256	`46767d7d093197ffe2bd9714cc3518aec007dc1ad1f08d05eeafaa372374831b`

See more details on using hashes here.

finance_data_llm 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

SEC-filings-Markdown

Configuration

MCP server

1. Start the vLLM backends

2. Install dependencies and run the MCP server

3. Expose with ngrok and connect a client

Tools and resources

Docker

Build

Run

Installation

Usage

Earnings call transcripts

1. Fetch transcripts

2. Start embedding server and API

3. Index transcripts in Chroma

4. Search across indexed quarters

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes