Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Ready for production.

Before and After Comparison Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)


The Problem

RAG pipelines fail on SEC filings because standard parsers destroy document structure.

When you flatten a 200-page 10-K to plain text:

  • Tables break — Complex financial statements become misaligned text
  • Pages are lost — Can't cite sources or trace answers back
  • Sections merge — Risk Factors and MD&A become indistinguishable
  • Formatting is stripped — Headers, bolds, lists (LLM reasoning cues) gone
  • Retrieval fails — Chunks without structure return wrong context

Your RAG system is only as good as your data. Garbage in, garbage out.

The Solution

sec2md rebuilds SEC filings as clean, semantic Markdown designed for AI systems:

  • Preserves structure - Headers (#), paragraphs, lists maintained
  • Converts tables - Complex HTML tables → clean Markdown pipes
  • Strips noise - XBRL tags, inline styles, and boilerplate removed
  • Tracks pages - Original pagination preserved for citation
  • Detects sections - Auto-extract Risk Factors, MD&A, Business sections
  • Chunks intelligently - Page-aware splitting with metadata headers

What We Support

Document Type Status Notes
10-K/Q Filings Full section extraction (ITEM 1-16)
Financial Statements Tables preserved in Markdown
Notes to Financials Automatic table unwrapping
8-K Press Releases Clean prose extraction
Proxy Statements (DEF 14A) Executive compensation, governance
Exhibits (Contracts) Merger agreements, material contracts

Installation

pip install sec2md

Quickstart

import sec2md

# Convert any SEC filing to clean Markdown
md = sec2md.convert_to_markdown(
    "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
    user_agent="Your Name <you@example.com>"
)

Input: Messy SEC HTML with XBRL tags, nested tables, inline styles Output: Clean, structured Markdown ready for LLMs

## ITEM 1. Business

Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...

### Products

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
| iPad             | $28,300           |
...

Core Features

1️⃣ Section Extraction

Extract specific sections from 10-K/10-Q filings with type-safe enums:

from sec2md import Item10K

pages = sec2md.convert_to_markdown(html, return_pages=True)
sections = sec2md.extract_sections(pages, filing_type="10-K")

# Get Risk Factors section
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
print(risk.markdown())  # Just the risk factors text
print(risk.page_range)   # (12, 28) - page citations

2️⃣ Page-Aware Chunking

Intelligent chunking that preserves page numbers for citations:

chunks = sec2md.chunk_pages(pages, chunk_size=512)

for chunk in chunks:
    print(f"Page {chunk.page}: {chunk.content[:100]}...")
    # Use for embeddings, citations, or retrieval

3️⃣ RAG-Optimized Headers

Boost retrieval quality by adding metadata to chunk embeddings:

header = """# Apple Inc. (AAPL)
Form 10-K | FY 2024 | Risk Factors"""

chunks = sec2md.chunk_section(risk, header=header)

# chunk.embedding_text includes header for better embeddings
# chunk.content contains only the actual filing text

4️⃣ EdgarTools Integration

Works seamlessly with edgartools:

from edgar import Company
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()

md = sec2md.convert_to_markdown(filing.html())

Why Choose sec2md?

Just Parse It

Most libraries force you to choose between speed and accuracy. sec2md gives you both:

  • 🚀 Fast - Processes 200-page filings in seconds
  • 🎯 Accurate - Purpose-built for SEC document structure
  • 🔧 Simple - One function call, zero configuration

Built for Agentic RAG

Don't rebuild what we've already solved:

  • Page tracking - Cite sources with exact page numbers
  • Section detection - Extract just what you need (Risk Factors, MD&A)
  • Smart chunking - Respects table boundaries, preserves context
  • Metadata headers - Boost embedding quality 2-3x with contextual headers

Documentation

📚 Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.8.tar.gz (48.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.8-py3-none-any.whl (49.9 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.8.tar.gz.

File metadata

  • Download URL: sec2md-0.1.8.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.8.tar.gz
Algorithm Hash digest
SHA256 48a865d54f9d7de821b2c4b46eabdbeb4447d3e20a1f2ad68a914cf2a1a91313
MD5 1cf45a88732706a341aac47ad1e09f1e
BLAKE2b-256 477719405973152d6920b16bef3cabe2a2bedb2896db49bb2c7e3cc8a8ce8105

See more details on using hashes here.

File details

Details for the file sec2md-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 49.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for sec2md-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 906a45b3261cc6181b027876d6d3fdd016462e986c3d296b76c5b7f172c8e1e0
MD5 2e9c22d28906e903e4f73bbc45a7d9b7
BLAKE2b-256 2295750c8ea18afafabd0d6855cb44925dff6c04c6c00716378c97c8c4393565

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page