Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Ready for production.

Before and After Comparison Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)


The Problem

RAG pipelines fail on SEC filings because standard parsers destroy document structure.

When you flatten a 200-page 10-K to plain text:

  • Tables break — Complex financial statements become misaligned text
  • Pages are lost — Can't cite sources or trace answers back
  • Sections merge — Risk Factors and MD&A become indistinguishable
  • Formatting is stripped — Headers, bolds, lists (LLM reasoning cues) gone
  • Retrieval fails — Chunks without structure return wrong context

Your RAG system is only as good as your data. Garbage in, garbage out.

The Solution

sec2md rebuilds SEC filings as clean, semantic Markdown designed for AI systems:

  • Preserves structure - Headers (#), paragraphs, lists maintained
  • Converts tables - Complex HTML tables → clean Markdown pipes
  • Strips noise - XBRL tags, inline styles, and boilerplate removed
  • Tracks pages - Original pagination preserved for citation
  • Detects sections - Auto-extract Risk Factors, MD&A, Business sections
  • Chunks intelligently - Page-aware splitting with metadata headers

What We Support

Document Type Status Notes
10-K/Q Filings Full section extraction (ITEM 1-16)
Financial Statements Tables preserved in Markdown
Notes to Financials Automatic table unwrapping
8-K Press Releases Clean prose extraction
Proxy Statements (DEF 14A) Executive compensation, governance
Exhibits (Contracts) Merger agreements, material contracts

Installation

pip install sec2md

Quickstart

import sec2md

# Convert any SEC filing to clean Markdown
md = sec2md.convert_to_markdown(
    "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
    user_agent="Your Name <you@example.com>"
)

Input: Messy SEC HTML with XBRL tags, nested tables, inline styles Output: Clean, structured Markdown ready for LLMs

## ITEM 1. Business

Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...

### Products

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
| iPad             | $28,300           |
...

Core Features

1️⃣ Section Extraction

Extract specific sections from 10-K/10-Q filings with type-safe enums:

from sec2md import Item10K

pages = sec2md.convert_to_markdown(html, return_pages=True)
sections = sec2md.extract_sections(pages, filing_type="10-K")

# Get Risk Factors section
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
print(risk.markdown())  # Just the risk factors text
print(risk.page_range)   # (12, 28) - page citations

2️⃣ Page-Aware Chunking

Intelligent chunking that preserves page numbers for citations:

chunks = sec2md.chunk_pages(pages, chunk_size=512)

for chunk in chunks:
    print(f"Page {chunk.page}: {chunk.content[:100]}...")
    # Use for embeddings, citations, or retrieval

3️⃣ RAG-Optimized Headers

Boost retrieval quality by adding metadata to chunk embeddings:

header = """# Apple Inc. (AAPL)
Form 10-K | FY 2024 | Risk Factors"""

chunks = sec2md.chunk_section(risk, header=header)

# chunk.embedding_text includes header for better embeddings
# chunk.content contains only the actual filing text

4️⃣ EdgarTools Integration

Works seamlessly with edgartools:

from edgar import Company
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()

md = sec2md.convert_to_markdown(filing.html())

Why Choose sec2md?

Just Parse It

Most libraries force you to choose between speed and accuracy. sec2md gives you both:

  • 🚀 Fast - Processes 200-page filings in seconds
  • 🎯 Accurate - Purpose-built for SEC document structure
  • 🔧 Simple - One function call, zero configuration

Built for Agentic RAG

Don't rebuild what we've already solved:

  • Page tracking - Cite sources with exact page numbers
  • Section detection - Extract just what you need (Risk Factors, MD&A)
  • Smart chunking - Respects table boundaries, preserves context
  • Metadata headers - Boost embedding quality 2-3x with contextual headers

Documentation

📚 Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.16.tar.gz (48.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.16-py3-none-any.whl (50.3 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.16.tar.gz.

File metadata

  • Download URL: sec2md-0.1.16.tar.gz
  • Upload date:
  • Size: 48.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.16.tar.gz
Algorithm Hash digest
SHA256 d7a822d43d348d4a1d743db90fd00374bb61a4c1d699c57c036953e215540b81
MD5 2b0605157963299c6c57c152bd46cfd2
BLAKE2b-256 dfebd32eec251e0e67b866691287081c7850e3e4e91411e92ee438d2f61835f6

See more details on using hashes here.

File details

Details for the file sec2md-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 50.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 8def367ca79ddba72d60fc8fc23a7e5dde921f6fb21f89bef406727aa0cd64ca
MD5 9cdb35cb64584e90248f6068651dd23e
BLAKE2b-256 d995784c38c9df381ea023691700b846ff51dcb7fc28de7b6075df6d8053bf20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page