Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Ready for production.

Before and After Comparison Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)


The Problem

RAG pipelines fail on SEC filings because standard parsers destroy document structure.

When you flatten a 200-page 10-K to plain text:

  • Tables break — Complex financial statements become misaligned text
  • Pages are lost — Can't cite sources or trace answers back
  • Sections merge — Risk Factors and MD&A become indistinguishable
  • Formatting is stripped — Headers, bolds, lists (LLM reasoning cues) gone
  • Retrieval fails — Chunks without structure return wrong context

Your RAG system is only as good as your data. Garbage in, garbage out.

The Solution

sec2md rebuilds SEC filings as clean, semantic Markdown designed for AI systems:

  • Preserves structure - Headers (#), paragraphs, lists maintained
  • Converts tables - Complex HTML tables → clean Markdown pipes
  • Strips noise - XBRL tags, inline styles, and boilerplate removed
  • Tracks pages - Original pagination preserved for citation
  • Detects sections - Auto-extract Risk Factors, MD&A, Business sections
  • Chunks intelligently - Page-aware splitting with metadata headers

What We Support

Document Type Status Notes
10-K/Q Filings Full section extraction (ITEM 1-16)
Financial Statements Tables preserved in Markdown
Notes to Financials Automatic table unwrapping
8-K Press Releases Clean prose extraction
Proxy Statements (DEF 14A) Executive compensation, governance
Exhibits (Contracts) Merger agreements, material contracts

Installation

pip install sec2md

Quickstart

import sec2md

# Convert any SEC filing to clean Markdown
md = sec2md.convert_to_markdown(
    "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
    user_agent="Your Name <you@example.com>"
)

Input: Messy SEC HTML with XBRL tags, nested tables, inline styles Output: Clean, structured Markdown ready for LLMs

## ITEM 1. Business

Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...

### Products

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
| iPad             | $28,300           |
...

Core Features

1️⃣ Section Extraction

Extract specific sections from 10-K/10-Q filings with type-safe enums:

from sec2md import Item10K

pages = sec2md.convert_to_markdown(html, return_pages=True)
sections = sec2md.extract_sections(pages, filing_type="10-K")

# Get Risk Factors section
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
print(risk.markdown())  # Just the risk factors text
print(risk.page_range)   # (12, 28) - page citations

2️⃣ Page-Aware Chunking

Intelligent chunking that preserves page numbers for citations:

chunks = sec2md.chunk_pages(pages, chunk_size=512)

for chunk in chunks:
    print(f"Page {chunk.page}: {chunk.content[:100]}...")
    # Use for embeddings, citations, or retrieval

3️⃣ RAG-Optimized Headers

Boost retrieval quality by adding metadata to chunk embeddings:

header = """# Apple Inc. (AAPL)
Form 10-K | FY 2024 | Risk Factors"""

chunks = sec2md.chunk_section(risk, header=header)

# chunk.embedding_text includes header for better embeddings
# chunk.content contains only the actual filing text

4️⃣ EdgarTools Integration

Works seamlessly with edgartools:

from edgar import Company
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()

md = sec2md.convert_to_markdown(filing.html())

Why Choose sec2md?

Just Parse It

Most libraries force you to choose between speed and accuracy. sec2md gives you both:

  • 🚀 Fast - Processes 200-page filings in seconds
  • 🎯 Accurate - Purpose-built for SEC document structure
  • 🔧 Simple - One function call, zero configuration

Built for Agentic RAG

Don't rebuild what we've already solved:

  • Page tracking - Cite sources with exact page numbers
  • Section detection - Extract just what you need (Risk Factors, MD&A)
  • Smart chunking - Respects table boundaries, preserves context
  • Metadata headers - Boost embedding quality 2-3x with contextual headers

Documentation

📚 Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.5.tar.gz (48.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.5-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.5.tar.gz.

File metadata

  • Download URL: sec2md-0.1.5.tar.gz
  • Upload date:
  • Size: 48.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e98e062de9c082972012bebe2ff13cf81c168bb065bbc3de9704753b1bd535bc
MD5 f695a5edf8023a62ce8b9d49c2a916c0
BLAKE2b-256 9a79e3ddb05fcc4356e523ed668b095428f0028fbd55e5ebfd3a5ecc0d167edd

See more details on using hashes here.

File details

Details for the file sec2md-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2dd9d6217c115eeff2f96f20d7722dfca2e478a0da9c8af0b124a8c5e4b76143
MD5 c9d659e68c30e1d72240f26715b27990
BLAKE2b-256 6a3e7e73ef8b4f766d735792247d4fe424f09f47fc8ddfb00dcf8ab08d9cb1ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page