Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Ready for production.

Before and After Comparison Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)


The Problem

RAG pipelines fail on SEC filings because standard parsers destroy document structure.

When you flatten a 200-page 10-K to plain text:

  • Tables break — Complex financial statements become misaligned text
  • Pages are lost — Can't cite sources or trace answers back
  • Sections merge — Risk Factors and MD&A become indistinguishable
  • Formatting is stripped — Headers, bolds, lists (LLM reasoning cues) gone
  • Retrieval fails — Chunks without structure return wrong context

Your RAG system is only as good as your data. Garbage in, garbage out.

The Solution

sec2md rebuilds SEC filings as clean, semantic Markdown designed for AI systems:

  • Preserves structure - Headers (#), paragraphs, lists maintained
  • Converts tables - Complex HTML tables → clean Markdown pipes
  • Strips noise - XBRL tags, inline styles, and boilerplate removed
  • Tracks pages - Original pagination preserved for citation
  • Detects sections - Auto-extract Risk Factors, MD&A, Business sections
  • Chunks intelligently - Page-aware splitting with metadata headers

What We Support

Document Type Status Notes
10-K/Q Filings Full section extraction (ITEM 1-16)
Financial Statements Tables preserved in Markdown
Notes to Financials Automatic table unwrapping
8-K Press Releases Clean prose extraction
Proxy Statements (DEF 14A) Executive compensation, governance
Exhibits (Contracts) Merger agreements, material contracts

Installation

pip install sec2md

Quickstart

import sec2md

# Convert any SEC filing to clean Markdown
md = sec2md.convert_to_markdown(
    "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
    user_agent="Your Name <you@example.com>"
)

Input: Messy SEC HTML with XBRL tags, nested tables, inline styles Output: Clean, structured Markdown ready for LLMs

## ITEM 1. Business

Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...

### Products

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
| iPad             | $28,300           |
...

Core Features

1️⃣ Section Extraction

Extract specific sections from 10-K/10-Q filings with type-safe enums:

from sec2md import Item10K

pages = sec2md.convert_to_markdown(html, return_pages=True)
sections = sec2md.extract_sections(pages, filing_type="10-K")

# Get Risk Factors section
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
print(risk.markdown())  # Just the risk factors text
print(risk.page_range)   # (12, 28) - page citations

2️⃣ Page-Aware Chunking

Intelligent chunking that preserves page numbers for citations:

chunks = sec2md.chunk_pages(pages, chunk_size=512)

for chunk in chunks:
    print(f"Page {chunk.page}: {chunk.content[:100]}...")
    # Use for embeddings, citations, or retrieval

3️⃣ RAG-Optimized Headers

Boost retrieval quality by adding metadata to chunk embeddings:

header = """# Apple Inc. (AAPL)
Form 10-K | FY 2024 | Risk Factors"""

chunks = sec2md.chunk_section(risk, header=header)

# chunk.embedding_text includes header for better embeddings
# chunk.content contains only the actual filing text

4️⃣ EdgarTools Integration

Works seamlessly with edgartools:

from edgar import Company
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()

md = sec2md.convert_to_markdown(filing.html())

Why Choose sec2md?

Just Parse It

Most libraries force you to choose between speed and accuracy. sec2md gives you both:

  • 🚀 Fast - Processes 200-page filings in seconds
  • 🎯 Accurate - Purpose-built for SEC document structure
  • 🔧 Simple - One function call, zero configuration

Built for Agentic RAG

Don't rebuild what we've already solved:

  • Page tracking - Cite sources with exact page numbers
  • Section detection - Extract just what you need (Risk Factors, MD&A)
  • Smart chunking - Respects table boundaries, preserves context
  • Metadata headers - Boost embedding quality 2-3x with contextual headers

Documentation

📚 Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.15.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.15-py3-none-any.whl (49.3 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.15.tar.gz.

File metadata

  • Download URL: sec2md-0.1.15.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.15.tar.gz
Algorithm Hash digest
SHA256 b6a9a0d455b7d911a499c1360495fcb0fdee8d512b16d80ef5caa69088db493d
MD5 7beb8b6c0cf72b99086afa52fd001763
BLAKE2b-256 fd6491900154928b45d2e00d9538371543853d74c9f836e27d71060b9f158c4a

See more details on using hashes here.

File details

Details for the file sec2md-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 49.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 f968278a72860a091fec603bd70d9035e4b9b7f828600dffe7c425c049f66656
MD5 8832a29fcecb45642d8b3251c3dc03bd
BLAKE2b-256 5f6cf32f641ed0f5326a36679dd15c15c38e7ee50ab4ccc53a73c0d55aded1bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page