Extract structured data from SEC filings using LLM + Pydantic presets
Project description
SEC-Analyzer
Extract structured data from SEC filings using LLM + Pydantic presets.
Turn any SEC filing (10-K, 10-Q, 20-F, DEF 14A, ...) into structured JSON — define a Pydantic model, and the library does the rest.
Installation · Quick Start · Custom Presets · API Reference · CLI
Why This Library?
SEC filings contain invaluable data — supply chains, revenue concentration, executive compensation, risk factors — but every filing has a different format. Traditional parsing breaks constantly.
This library uses LLM structured output (Gemini) to extract exactly the data you define in a Pydantic model. The LLM reads the filing and fills in your schema. No regex, no HTML parsing, no breakage.
from sec_analyzer import extract
from sec_analyzer.presets import SupplyChain
result = extract("NVDA", preset=SupplyChain)
print(result["data"]["suppliers"])
# [{'entity': 'Taiwan Semiconductor Manufacturing Company Limited',
# 'relationship': 'foundry for semiconductor wafers',
# 'context': 'We utilize foundries, such as TSMC and Samsung...'}, ...]
Installation
pip install sec-analyzer
Requires Python 3.10+ and a Google AI API key.
Quick Start
1. Set your API key
export GOOGLE_API_KEY="your-key-here"
export EDGAR_IDENTITY="YourApp/1.0 your@email.com"
Or create a .env file:
GOOGLE_API_KEY=your-key-here
EDGAR_IDENTITY=YourApp/1.0 your@email.com
2. Extract data
from sec_analyzer import extract
from sec_analyzer.presets import SupplyChain
# Latest 10-K
result = extract("NVDA", preset=SupplyChain)
# Specific form
result = extract("TSM", preset=SupplyChain, form="20-F")
# Specific filing date
result = extract("AAPL", preset=SupplyChain, filing_date="2025-10-30")
3. Use the result
filing = result["filing"]
# {'form': '10-K', 'filing_date': '2026-02-25', 'accession_number': '...', 'filing_url': '...'}
data = result["data"]
print(f"Suppliers: {len(data['suppliers'])}")
print(f"Customers: {len(data['customers'])}")
print(f"Single-source deps: {len(data['single_source_dependencies'])}")
Custom Presets
The real power: define your own Pydantic model to extract anything.
Basic custom preset
from pydantic import BaseModel, Field
from sec_analyzer import extract
class RiskFactors(BaseModel):
regulatory_risks: list[dict] = Field(
default_factory=list,
description="Government regulations that could impact the business"
)
litigation: list[dict] = Field(
default_factory=list,
description="Pending lawsuits and legal proceedings"
)
cybersecurity_risks: list[dict] = Field(
default_factory=list,
description="Data breach and cybersecurity threats"
)
result = extract("META", preset=RiskFactors)
When no __prompt__ is defined, the library auto-generates a prompt from your field descriptions.
Advanced: custom prompt
For expert-level control, add a __prompt__ class variable:
from typing import ClassVar
from pydantic import BaseModel, Field
class ExecutiveComp(BaseModel):
__prompt__: ClassVar[str] = """\
You are analyzing a DEF 14A proxy statement for {company_name}.
Extract executive compensation data from the Summary Compensation Table
and related disclosure sections.
Rules:
1. Include only Named Executive Officers (NEOs)
2. All dollar amounts in exact figures from the filing
3. Include stock awards, option awards, and non-equity incentive plan separately
Filing text:
{filing_text}
"""
executives: list[dict] = Field(description="NEO compensation details")
equity_awards: list[dict] = Field(description="Stock and option grant details")
result = extract("AAPL", preset=ExecutiveComp, form="DEF 14A")
The {company_name} and {filing_text} placeholders are filled automatically.
Built-in Presets
SupplyChain
Extracts 11 categories of supply chain intelligence from 10-K/10-Q/20-F filings:
| Category | Description |
|---|---|
suppliers |
Companies supplying products/materials/services |
customers |
Companies purchasing products/services |
single_source_dependencies |
Components with sole-source suppliers |
geographic_concentration |
Manufacturing/sourcing location concentration |
capacity_constraints |
Production limitations and lead times |
supply_chain_risks |
Disruption risks (tariffs, shortages, geopolitical) |
revenue_concentration |
Customer/segment revenue % from Notes |
geographic_revenue |
Revenue by country/region from Notes |
purchase_obligations |
Commitments and take-or-pay contracts |
market_risk_disclosures |
Commodity/FX/interest rate exposures (Item 7A) |
inventory_composition |
Raw materials/WIP/finished goods breakdown |
API Reference
extract(symbol, preset, form="10-K", filing_date=None, max_chars=2_000_000, api_key=None, model=None)
| Parameter | Type | Description |
|---|---|---|
symbol |
str | Ticker symbol (e.g., "NVDA") |
preset |
BaseModel class | Pydantic model defining extraction schema |
form |
str | Filing type. Auto-fallback 10-K → 20-F |
filing_date |
str | Specific date (YYYY-MM-DD). None = latest |
max_chars |
int | Max filing markdown length |
api_key |
str | Google API key (fallback: GOOGLE_API_KEY env) |
model |
str | Gemini model (fallback: GOOGLE_MODEL env, default: gemini-2.5-flash) |
Returns {"filing": {...}, "data": {...}}
CLI
# Supply chain extraction (default)
sec-analyzer NVDA
# Specific form
sec-analyzer TSM --form 20-F
# Compact JSON
sec-analyzer NVDA --json
# Specific filing date
sec-analyzer AAPL --filing-date 2025-10-30
How It Works
1. edgartools finds the filing on SEC EDGAR
2. Filing converted to markdown (tables preserved)
3. Full markdown + Pydantic schema sent to Gemini
4. Gemini returns structured JSON matching the schema
5. Pydantic validates and returns typed data
The key insight: Gemini's structured output mode forces the response to match your Pydantic schema exactly. No post-processing, no regex, no parsing.
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
GOOGLE_API_KEY |
Yes | - | Google AI API key |
EDGAR_IDENTITY |
No | SECAnalyzer/1.0 user@example.com |
SEC EDGAR User-Agent |
GOOGLE_MODEL |
No | gemini-2.5-flash |
Gemini model ID |
Disclaimer
This project is not affiliated with the SEC, EDGAR, or Google. Filing data comes from SEC EDGAR (public). LLM extraction may contain errors — always verify critical data against the original filing.
This tool is for research and educational purposes only. It is not financial advice.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sec_analyzer-0.1.0.tar.gz.
File metadata
- Download URL: sec_analyzer-0.1.0.tar.gz
- Upload date:
- Size: 68.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
061ea6e6b9dfa73d4b7c50f1fdd1846d3ef6125afe2bc03a5e4984640cd7fad8
|
|
| MD5 |
ec17d5662f2a71314e645d0e82816779
|
|
| BLAKE2b-256 |
2a25dd077a233847f6d943c870c975cea14411add5190d1aee551217446a6de9
|
File details
Details for the file sec_analyzer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sec_analyzer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53d9a6e59f150bf7dfbda5f1efa678886788ebfc718516797253bad789d3ff65
|
|
| MD5 |
65d701ee17f908cd0c7d86e6c1fc1a9b
|
|
| BLAKE2b-256 |
28a5bc7e22a9c96f072130907f0b3fa332aca67f2d03eb6841c29f80050b79b5
|