Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Traceable to the source.

Before and After Comparison Apple 10-K: Raw SEC HTML (left) vs. sec2md output (right)


The Problem

SEC filings are the worst documents you'll ever feed to an LLM — 200 pages of nested HTML, XBRL tags, invisible elements, and tables-within-tables.

When you throw this at a standard parser:

  • Tables break — Financial statements become garbled text. Your model hallucinates numbers.
  • Pages vanish — Can't cite sources. Can't trace answers back. Compliance says no.
  • Sections blur — Risk Factors and MD&A become one wall of text. Retrieval pulls the wrong context.
  • Structure is lost — Headers, emphasis, lists — the cues LLMs use to reason — gone.

And even the converters that handle the HTML well still throw away provenance. You get clean text with no way to trace it back to where it came from in the original filing. For production RAG on regulated documents, that's a dealbreaker.

The Solution

import sec2md

parser = sec2md.Parser(filing_html)
pages = parser.get_pages()

# 60 pages | 293 citable elements | 46,238 tokens
# Tables intact. Pages tracked. Sections detected. Every element traceable.

sec2md rebuilds SEC filings as clean, semantic Markdown — preserving the structure, tables, and pagination that make retrieval possible. But unlike generic converters, it also preserves the full citation chain from every piece of output back to the source HTML.


Traceability

Every paragraph, table, and heading gets a stable element ID that maps directly to a DOM node in the original filing HTML. From chunk to element to source — the chain is unbroken.

chunks = sec2md.chunk_pages(pages, chunk_size=512)

chunk = chunks[5]
print(chunk.element_ids)
# ['sec2md-p12-p0-a1b2c3d4', 'sec2md-p12-t0-e5f6a7b8', ...]
print(chunk.page_range)          # (12, 13)
print(chunk.display_page_range)  # (45, 46) — as printed in the filing

# Open the original filing in your browser — scrolls to the source, highlights in yellow
chunk.visualize(parser.html())

Traceability chunk.visualize() opens the original filing HTML, scrolls to the chunk's source elements, and highlights them.

Every Chunk carries page numbers (both sequential and the original display page from the filing footer), element IDs for citation, and a direct link back to the source HTML. Every Element can do the same:

element = chunk.elements[0]
element.visualize(parser.html())  # Highlights just this element

What You Can Do

Extract sections

Don't process 200 pages when you only need Risk Factors:

from sec2md import Item10K

sections = sec2md.extract_sections(pages, filing_type="10-K")
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)

print(risk.page_range)  # (7, 19)
print(risk.tokens)       # 11,474

Works across 10-K, 10-Q, 8-K, and 20-F.

Filing Type Section Extraction Notes
10-K 18 items (ITEM 1-16) Full PART/ITEM detection
10-Q 11 items (Parts I & II) Including financial statements
8-K 41 items (1.01-9.01) With exhibit parsing from 9.01
20-F Items 1-19, 16A-16I Foreign private issuers
DEF 14A, Exhibits -- Parsed as clean Markdown

Handle complex tables

SEC tables are notoriously complex — rowspans, colspans, merged cells, currency symbols in separate columns. Some filings don't even use <table> tags, building tables from absolutely-positioned CSS divs instead. sec2md handles both, and large tables are automatically split across chunks with headers preserved.

Works with edgartools

Pair with edgartools for end-to-end filing pipelines:

from edgar import Company

company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()
pages = sec2md.parse_filing(filing.html())

Installation

pip install sec2md

Getting Started

Try the Getting Started notebook — parse a real 10-K, extract sections, chunk for RAG, and visualize traceability in under a minute.

Documentation

Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.21.tar.gz (61.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.21-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.21.tar.gz.

File metadata

  • Download URL: sec2md-0.1.21.tar.gz
  • Upload date:
  • Size: 61.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sec2md-0.1.21.tar.gz
Algorithm Hash digest
SHA256 662244e9db7d20be2b4d509e8dc63f80c1aaaa4ef34c1c195597ab4213c4c584
MD5 e9963d81cc0ec9f110667582f56963a2
BLAKE2b-256 f7cde6b07fb03543ad84abd971961b8490a528919e96a883b6eaa74755727276

See more details on using hashes here.

File details

Details for the file sec2md-0.1.21-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.21-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sec2md-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 ec4140acc241def1a42fc5992190e80bb6e62bc0f3a1d1b933aec300b609c2f0
MD5 26fe0fdf63c67feded48bd72dc2c885e
BLAKE2b-256 21ecafd7f625ad460d56773c081b8453ac7dcd1926a56e8e108db7164d87604c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page