Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Ready for production.

Before and After Comparison Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)


The Problem

RAG pipelines fail on SEC filings because standard parsers destroy document structure.

When you flatten a 200-page 10-K to plain text:

  • Tables break — Complex financial statements become misaligned text
  • Pages are lost — Can't cite sources or trace answers back
  • Sections merge — Risk Factors and MD&A become indistinguishable
  • Formatting is stripped — Headers, bolds, lists (LLM reasoning cues) gone
  • Retrieval fails — Chunks without structure return wrong context

Your RAG system is only as good as your data. Garbage in, garbage out.

The Solution

sec2md rebuilds SEC filings as clean, semantic Markdown designed for AI systems:

  • Preserves structure - Headers (#), paragraphs, lists maintained
  • Converts tables - Complex HTML tables → clean Markdown pipes
  • Strips noise - XBRL tags, inline styles, and boilerplate removed
  • Tracks pages - Original pagination preserved for citation
  • Detects sections - Auto-extract Risk Factors, MD&A, Business sections
  • Chunks intelligently - Page-aware splitting with metadata headers

What We Support

Document Type Status Notes
10-K/Q Filings Full section extraction (ITEM 1-16)
Financial Statements Tables preserved in Markdown
Notes to Financials Automatic table unwrapping
8-K Press Releases Clean prose extraction
Proxy Statements (DEF 14A) Executive compensation, governance
Exhibits (Contracts) Merger agreements, material contracts

Installation

pip install sec2md

Quickstart

import sec2md

# Convert any SEC filing to clean Markdown
md = sec2md.convert_to_markdown(
    "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
    user_agent="Your Name <you@example.com>"
)

Input: Messy SEC HTML with XBRL tags, nested tables, inline styles Output: Clean, structured Markdown ready for LLMs

## ITEM 1. Business

Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...

### Products

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
| iPad             | $28,300           |
...

Core Features

1️⃣ Section Extraction

Extract specific sections from 10-K/10-Q filings with type-safe enums:

from sec2md import Item10K

pages = sec2md.convert_to_markdown(html, return_pages=True)
sections = sec2md.extract_sections(pages, filing_type="10-K")

# Get Risk Factors section
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
print(risk.markdown())  # Just the risk factors text
print(risk.page_range)   # (12, 28) - page citations

2️⃣ Page-Aware Chunking

Intelligent chunking that preserves page numbers for citations:

chunks = sec2md.chunk_pages(pages, chunk_size=512)

for chunk in chunks:
    print(f"Page {chunk.page}: {chunk.content[:100]}...")
    # Use for embeddings, citations, or retrieval

3️⃣ RAG-Optimized Headers

Boost retrieval quality by adding metadata to chunk embeddings:

header = """# Apple Inc. (AAPL)
Form 10-K | FY 2024 | Risk Factors"""

chunks = sec2md.chunk_section(risk, header=header)

# chunk.embedding_text includes header for better embeddings
# chunk.content contains only the actual filing text

4️⃣ EdgarTools Integration

Works seamlessly with edgartools:

from edgar import Company
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()

md = sec2md.convert_to_markdown(filing.html())

Why Choose sec2md?

Just Parse It

Most libraries force you to choose between speed and accuracy. sec2md gives you both:

  • 🚀 Fast - Processes 200-page filings in seconds
  • 🎯 Accurate - Purpose-built for SEC document structure
  • 🔧 Simple - One function call, zero configuration

Built for Agentic RAG

Don't rebuild what we've already solved:

  • Page tracking - Cite sources with exact page numbers
  • Section detection - Extract just what you need (Risk Factors, MD&A)
  • Smart chunking - Respects table boundaries, preserves context
  • Metadata headers - Boost embedding quality 2-3x with contextual headers

Documentation

📚 Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.14.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.14-py3-none-any.whl (47.9 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.14.tar.gz.

File metadata

  • Download URL: sec2md-0.1.14.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.14.tar.gz
Algorithm Hash digest
SHA256 ae06c23d866df7e686d6eaab5dde251d7e3174cd7db388ca1a43d1ac96923470
MD5 ec079749c121afa1a913300d1614fff4
BLAKE2b-256 78f6f73d48106dfe6956c15a3074dc6322684c314a330ad51ea55c9682b7a03d

See more details on using hashes here.

File details

Details for the file sec2md-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 47.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 d94094e057496abd952fccfc2fa3e7cae299fab44c05db59b194aef729930ee1
MD5 6cda24fb4fcb038c2d7ce622b4edd15b
BLAKE2b-256 49ecb8da6719a274adbf4cbff06c5bd2df84a7bae646d3bcbe5501f24d562022

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page