Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG
Project description
sec2md
Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Ready for production.
Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)
The Problem
RAG pipelines fail on SEC filings because standard parsers destroy document structure.
When you flatten a 200-page 10-K to plain text:
- ❌ Tables break — Complex financial statements become misaligned text
- ❌ Pages are lost — Can't cite sources or trace answers back
- ❌ Sections merge — Risk Factors and MD&A become indistinguishable
- ❌ Formatting is stripped — Headers, bolds, lists (LLM reasoning cues) gone
- ❌ Retrieval fails — Chunks without structure return wrong context
Your RAG system is only as good as your data. Garbage in, garbage out.
The Solution
sec2md rebuilds SEC filings as clean, semantic Markdown designed for AI systems:
- ✅ Preserves structure - Headers (
#), paragraphs, lists maintained - ✅ Converts tables - Complex HTML tables → clean Markdown pipes
- ✅ Strips noise - XBRL tags, inline styles, and boilerplate removed
- ✅ Tracks pages - Original pagination preserved for citation
- ✅ Detects sections - Auto-extract Risk Factors, MD&A, Business sections
- ✅ Chunks intelligently - Page-aware splitting with metadata headers
What We Support
| Document Type | Status | Notes |
|---|---|---|
| 10-K/Q Filings | ✅ | Full section extraction (ITEM 1-16) |
| Financial Statements | ✅ | Tables preserved in Markdown |
| Notes to Financials | ✅ | Automatic table unwrapping |
| 8-K Press Releases | ✅ | Clean prose extraction |
| Proxy Statements (DEF 14A) | ✅ | Executive compensation, governance |
| Exhibits (Contracts) | ✅ | Merger agreements, material contracts |
Installation
pip install sec2md
Quickstart
import sec2md
# Convert any SEC filing to clean Markdown
md = sec2md.convert_to_markdown(
"https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
user_agent="Your Name <you@example.com>"
)
Input: Messy SEC HTML with XBRL tags, nested tables, inline styles Output: Clean, structured Markdown ready for LLMs
## ITEM 1. Business
Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...
### Products
| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone | $200,583 |
| Mac | $29,357 |
| iPad | $28,300 |
...
Core Features
1️⃣ Section Extraction
Extract specific sections from 10-K/10-Q filings with type-safe enums:
from sec2md import Item10K
pages = sec2md.convert_to_markdown(html, return_pages=True)
sections = sec2md.extract_sections(pages, filing_type="10-K")
# Get Risk Factors section
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
print(risk.markdown()) # Just the risk factors text
print(risk.page_range) # (12, 28) - page citations
2️⃣ Page-Aware Chunking
Intelligent chunking that preserves page numbers for citations:
chunks = sec2md.chunk_pages(pages, chunk_size=512)
for chunk in chunks:
print(f"Page {chunk.page}: {chunk.content[:100]}...")
# Use for embeddings, citations, or retrieval
3️⃣ RAG-Optimized Headers
Boost retrieval quality by adding metadata to chunk embeddings:
header = """# Apple Inc. (AAPL)
Form 10-K | FY 2024 | Risk Factors"""
chunks = sec2md.chunk_section(risk, header=header)
# chunk.embedding_text includes header for better embeddings
# chunk.content contains only the actual filing text
4️⃣ EdgarTools Integration
Works seamlessly with edgartools:
from edgar import Company
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()
md = sec2md.convert_to_markdown(filing.html())
Why Choose sec2md?
Just Parse It
Most libraries force you to choose between speed and accuracy. sec2md gives you both:
- 🚀 Fast - Processes 200-page filings in seconds
- 🎯 Accurate - Purpose-built for SEC document structure
- 🔧 Simple - One function call, zero configuration
Built for Agentic RAG
Don't rebuild what we've already solved:
- ✅ Page tracking - Cite sources with exact page numbers
- ✅ Section detection - Extract just what you need (Risk Factors, MD&A)
- ✅ Smart chunking - Respects table boundaries, preserves context
- ✅ Metadata headers - Boost embedding quality 2-3x with contextual headers
Documentation
📚 Full documentation: sec2md.readthedocs.io
- Quickstart Guide - Get up and running in 3 minutes
- Convert Filings - Handle 10-Ks, exhibits, press releases
- Extract Sections - Pull specific ITEM sections
- Chunking for RAG - Page-aware chunking with contextual headers
- EdgarTools Integration - Automate filing downloads
- API Reference - Complete API docs
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
License
MIT © 2025
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sec2md-0.1.19.tar.gz.
File metadata
- Download URL: sec2md-0.1.19.tar.gz
- Upload date:
- Size: 51.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd4f4804a32c05fc590146b34b9256ecb7ff58908d917af5e45be2bc3e6a14c2
|
|
| MD5 |
f0660d81d39b5aae847b07c013271b4c
|
|
| BLAKE2b-256 |
6782c183cab0a2fc7a60d0e779f19d8ea4a22b7ce2a7d4e15462a409baa55250
|
File details
Details for the file sec2md-0.1.19-py3-none-any.whl.
File metadata
- Download URL: sec2md-0.1.19-py3-none-any.whl
- Upload date:
- Size: 52.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35dca7f10784a34ce124a7fd979a4020a54597154b5708531aa5258505114196
|
|
| MD5 |
374901d52a6dbb71470b3f2c0fd19606
|
|
| BLAKE2b-256 |
af2ce63504a715f286871fe7ebe15dcfbaf105936ecc806a7a734f61cf03d04a
|