Skip to main content

Extract structured data from SEC filings using LLM + Pydantic presets

Project description

SEC

SEC-Analyzer

Extract structured data from SEC filings using LLM + Pydantic presets.

Turn any SEC filing (10-K, 10-Q, 20-F, DEF 14A, ...) into structured JSON — define a Pydantic model, and the library does the rest.

Python License: MIT

Installation · Quick Start · Custom Presets · API Reference · CLI


Why This Library?

SEC filings contain invaluable data — supply chains, revenue concentration, executive compensation, risk factors — but every filing has a different format. Traditional parsing breaks constantly.

This library uses LLM structured output (Gemini) to extract exactly the data you define in a Pydantic model. The LLM reads the filing and fills in your schema. No regex, no HTML parsing, no breakage.

from sec_analyzer import extract
from sec_analyzer.presets import SupplyChain

result = extract("NVDA", preset=SupplyChain)
print(result["data"]["suppliers"])
# [{'entity': 'Taiwan Semiconductor Manufacturing Company Limited',
#   'relationship': 'foundry for semiconductor wafers',
#   'context': 'We utilize foundries, such as TSMC and Samsung...'}, ...]

Installation

pip install sec-analyzer

Requires Python 3.10+ and a Google AI API key.


Quick Start

1. Set your API key

export GOOGLE_API_KEY="your-key-here"
export EDGAR_IDENTITY="YourApp/1.0 your@email.com"

Or create a .env file:

GOOGLE_API_KEY=your-key-here
EDGAR_IDENTITY=YourApp/1.0 your@email.com

2. Extract data

from sec_analyzer import extract
from sec_analyzer.presets import SupplyChain

# Latest 10-K
result = extract("NVDA", preset=SupplyChain)

# Specific form
result = extract("TSM", preset=SupplyChain, form="20-F")

# Specific filing date
result = extract("AAPL", preset=SupplyChain, filing_date="2025-10-30")

3. Use the result

filing = result["filing"]
# {'form': '10-K', 'filing_date': '2026-02-25', 'accession_number': '...', 'filing_url': '...'}

data = result["data"]
print(f"Suppliers: {len(data['suppliers'])}")
print(f"Customers: {len(data['customers'])}")
print(f"Single-source deps: {len(data['single_source_dependencies'])}")

Custom Presets

The real power: define your own Pydantic model to extract anything.

Basic custom preset

from pydantic import BaseModel, Field
from sec_analyzer import extract

class RiskFactors(BaseModel):
    regulatory_risks: list[dict] = Field(
        default_factory=list,
        description="Government regulations that could impact the business"
    )
    litigation: list[dict] = Field(
        default_factory=list,
        description="Pending lawsuits and legal proceedings"
    )
    cybersecurity_risks: list[dict] = Field(
        default_factory=list,
        description="Data breach and cybersecurity threats"
    )

result = extract("META", preset=RiskFactors)

When no __prompt__ is defined, the library auto-generates a prompt from your field descriptions.

Advanced: custom prompt

For expert-level control, add a __prompt__ class variable:

from typing import ClassVar
from pydantic import BaseModel, Field

class ExecutiveComp(BaseModel):
    __prompt__: ClassVar[str] = """\
You are analyzing a DEF 14A proxy statement for {company_name}.
Extract executive compensation data from the Summary Compensation Table
and related disclosure sections.

Rules:
1. Include only Named Executive Officers (NEOs)
2. All dollar amounts in exact figures from the filing
3. Include stock awards, option awards, and non-equity incentive plan separately

Filing text:
{filing_text}
"""

    executives: list[dict] = Field(description="NEO compensation details")
    equity_awards: list[dict] = Field(description="Stock and option grant details")

result = extract("AAPL", preset=ExecutiveComp, form="DEF 14A")

The {company_name} and {filing_text} placeholders are filled automatically.


Built-in Presets

SupplyChain

Extracts 11 categories of supply chain intelligence from 10-K/10-Q/20-F filings:

Category Description
suppliers Companies supplying products/materials/services
customers Companies purchasing products/services
single_source_dependencies Components with sole-source suppliers
geographic_concentration Manufacturing/sourcing location concentration
capacity_constraints Production limitations and lead times
supply_chain_risks Disruption risks (tariffs, shortages, geopolitical)
revenue_concentration Customer/segment revenue % from Notes
geographic_revenue Revenue by country/region from Notes
purchase_obligations Commitments and take-or-pay contracts
market_risk_disclosures Commodity/FX/interest rate exposures (Item 7A)
inventory_composition Raw materials/WIP/finished goods breakdown

API Reference

extract(symbol, preset, form="10-K", filing_date=None, max_chars=2_000_000, api_key=None, model=None)

Parameter Type Description
symbol str Ticker symbol (e.g., "NVDA")
preset BaseModel class Pydantic model defining extraction schema
form str Filing type. Auto-fallback 10-K → 20-F
filing_date str Specific date (YYYY-MM-DD). None = latest
max_chars int Max filing markdown length
api_key str Google API key (fallback: GOOGLE_API_KEY env)
model str Gemini model (fallback: GOOGLE_MODEL env, default: gemini-2.5-flash)

Returns {"filing": {...}, "data": {...}}


CLI

# Supply chain extraction (default)
sec-analyzer NVDA

# Specific form
sec-analyzer TSM --form 20-F

# Compact JSON
sec-analyzer NVDA --json

# Specific filing date
sec-analyzer AAPL --filing-date 2025-10-30

How It Works

1. edgartools finds the filing on SEC EDGAR
2. Filing converted to markdown (tables preserved)
3. Full markdown + Pydantic schema sent to Gemini
4. Gemini returns structured JSON matching the schema
5. Pydantic validates and returns typed data

The key insight: Gemini's structured output mode forces the response to match your Pydantic schema exactly. No post-processing, no regex, no parsing.


Environment Variables

Variable Required Default Description
GOOGLE_API_KEY Yes - Google AI API key
EDGAR_IDENTITY No SECAnalyzer/1.0 user@example.com SEC EDGAR User-Agent
GOOGLE_MODEL No gemini-2.5-flash Gemini model ID

Disclaimer

This project is not affiliated with the SEC, EDGAR, or Google. Filing data comes from SEC EDGAR (public). LLM extraction may contain errors — always verify critical data against the original filing.

This tool is for research and educational purposes only. It is not financial advice.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec_analyzer-0.1.0.tar.gz (68.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec_analyzer-0.1.0-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file sec_analyzer-0.1.0.tar.gz.

File metadata

  • Download URL: sec_analyzer-0.1.0.tar.gz
  • Upload date:
  • Size: 68.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sec_analyzer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 061ea6e6b9dfa73d4b7c50f1fdd1846d3ef6125afe2bc03a5e4984640cd7fad8
MD5 ec17d5662f2a71314e645d0e82816779
BLAKE2b-256 2a25dd077a233847f6d943c870c975cea14411add5190d1aee551217446a6de9

See more details on using hashes here.

File details

Details for the file sec_analyzer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sec_analyzer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sec_analyzer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 53d9a6e59f150bf7dfbda5f1efa678886788ebfc718516797253bad789d3ff65
MD5 65d701ee17f908cd0c7d86e6c1fc1a9b
BLAKE2b-256 28a5bc7e22a9c96f072130907f0b3fa332aca67f2d03eb6841c29f80050b79b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page