Skip to main content

Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG

Project description

sec2md

PyPI License: MIT Documentation

Transform messy SEC filings into clean, structured Markdown. Built for AI. Optimized for retrieval. Traceable to the source.

Before and After Comparison Apple 10-K: Raw SEC HTML (left) vs. sec2md output (right)


The Problem

SEC filings are the worst documents you'll ever feed to an LLM — 200 pages of nested HTML, XBRL tags, invisible elements, and tables-within-tables. Standard parsers break tables into garbled text, collapse sections into a single wall of prose, and lose the formatting cues that LLMs need to reason over structured content.

But even the converters that handle the HTML well still throw away provenance. You get clean text with no way to trace an answer back to where it came from in the original filing. For production RAG on regulated documents, that's a dealbreaker.

The Solution

sec2md rebuilds SEC filings as clean, semantic Markdown — preserving structure, tables, and pagination. Unlike generic converters, it also preserves the full citation chain from every piece of output back to the source HTML, and extracts iXBRL tags so you can filter by the accounting taxonomy itself.


Usage

1. Convert a Filing to Markdown

import sec2md

pages = sec2md.parse_filing(
    "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm",
    user_agent="Your Name <you@example.com>"
)

pages[0]
# Page(number=1, tokens=412, elements=8, preview='**FORM 10-K** ...')
#   .content    → Clean markdown text
#   .elements   → [Element(id='sec2md-p1-s0-...', kind='section', ...), ...]
#   .tokens     → 412

# 60 pages | 293 citable elements | 46,238 tokens

2. Extract Sections

A 10-K is modular — Business, Risk Factors, MD&A, Financial Statements. sec2md detects PART and ITEM boundaries automatically, so you can pull exactly the section you need instead of processing 200 pages:

from sec2md import Item10K

sections = sec2md.extract_sections(pages, filing_type="10-K")
risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)

risk
# Section(item='ITEM 1A', title='Risk Factors', pages=7-19, tokens=11474)
#   .markdown()   → Full section as markdown string
#   .page_range   → (7, 19)
#   .pages        → [Page(...), Page(...), ...]

3. Chunk for RAG

Page-aware, token-budgeted chunks — each one carrying page numbers, element IDs, XBRL tags, and display pages from the filing footer:

chunks = sec2md.chunk_pages(pages, chunk_size=512)

chunks[5]
# Chunk[5](pages=12-13, display_pages=45-46, blocks=4, tokens=487)
#   .content         → Clean markdown text
#   .page_range      → (12, 13)
#   .element_ids     → ['sec2md-p12-t3-a1b2c3d4', 'sec2md-p12-p4-e5f6g7h8', ...]
#   .tags            → ['us-gaap:Assets', 'us-gaap:Liabilities', ...]
#   .has_table       → True

You can also chunk individual sections or XBRL TextBlocks. Large tables are automatically split across chunks with headers preserved.


Supported Filings

sec2md works with any SEC filing served as HTML. For filings with standardized structure, it also extracts individual sections automatically:

Filing Type Section Extraction
10-K 18 items (ITEM 1–16), full PART/ITEM detection
10-Q 11 items (Parts I & II)
8-K 41 items (1.01–9.01), exhibit parsing
20-F Items 1–19, 16A–16I
SC 13D 7 items (Items 1–7)
SC 13G 10 items (Items 1–10)

All other filing types — S-1, S-3, S-4, F-1, 424B, 6-K, DEF 14A, DEFA14A, 40-F, N-CSR, SC TO-T, and any HTML exhibit or attachment — are parsed as clean Markdown with full traceability.

Complex Table Handling

SEC tables are notoriously complex — rowspans, colspans, merged cells, currency symbols in separate columns. Some filings don't even use <table> tags, building tables from absolutely-positioned CSS divs instead.

sec2md handles both:

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
| iPad             | $28,300           |

Multimodal: Image Extraction

Charts, performance graphs, and segment breakdowns are extracted as first-class elements — same page tracking, same element IDs, same citation chain as every paragraph and table:

chunks = sec2md.chunk_pages(pages)

image_chunks = [c for c in chunks if c.has_image]
image_chunks[0]
# Chunk[12](pages=5, blocks=2, tokens=156)
#   .images      → [Element(id='sec2md-p5-i0-...', kind='image', ...)]
#   .has_image   → True

# Self-contained HTML — no broken image links
pages = sec2md.parse_filing(url, user_agent="...", embed_images=True)

Feed image chunks to a vision model, text chunks to a text model. Every image stays traceable back to the source filing.

Traceability

Every paragraph, table, and heading gets a stable element ID that maps directly to a DOM node in the original filing HTML. From chunk to element to source — the chain is unbroken.

The parser injects these IDs directly into the HTML via parser.html() — so every element in your Markdown output has a corresponding tagged node in the source. You can store that annotated HTML yourself, and given any chunk's element_ids, locate and highlight the exact source nodes in the original filing.

parser = sec2md.Parser(filing_html)
pages = parser.get_pages()
chunks = sec2md.chunk_pages(pages)

# The annotated HTML has element IDs injected into the DOM
annotated_html = parser.html()

# See exactly where a chunk comes from in the original filing
chunk = chunks[5]
chunk.visualize(annotated_html)

# Or drill down to a single element
chunk.elements[0].visualize(annotated_html)

Traceability element.visualize() opens the original filing HTML, scrolls to the source element, and highlights it.

When your LLM says "revenue was $394B" and compliance asks show me — you can point to the exact location in the filing. Not the chunk. Not the Markdown. The source.

iXBRL Tag Extraction

iXBRL filings embed structured financial facts directly in the HTML. sec2md extracts the XBRL concept names and attaches them to elements and chunks — giving you a metadata filter for retrieval. Instead of relying on semantic search alone, you can scope your query to only chunks tagged with the exact XBRL concepts you care about.

pages = sec2md.parse_filing(url, user_agent="...")
chunks = sec2md.chunk_pages(pages)

# Store chunk.tags as metadata in your vector DB, then filter at query time:
# "What was Apple's revenue?" + metadata filter: tags contains 'us-gaap:Revenue*'

# Or filter in code — find the balance sheet
[e for p in pages for e in (p.elements or []) if e.tags and 'us-gaap:Assets' in e.tags]

# All revenue-tagged chunks
[c for c in chunks if any('Revenue' in t for t in c.tags)]

On a real Apple 10-K: 76 of 293 elements carry XBRL tags across 330 distinct concepts. The Income Statement table alone carries 15 tags, the Balance Sheet 32, Cash Flows 29. Cover page elements get dei:* tags, and notes get their TextBlock concept names.


Installation

pip install sec2md

Getting Started

Try the Getting Started notebook — parse a real 10-K, extract sections, chunk for RAG, and visualize traceability in under a minute.

Works with edgartools

from edgar import Company

company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()
pages = sec2md.parse_filing(filing.html())

Documentation

Full documentation: sec2md.readthedocs.io


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT © 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec2md-0.1.23.tar.gz (69.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sec2md-0.1.23-py3-none-any.whl (56.2 kB view details)

Uploaded Python 3

File details

Details for the file sec2md-0.1.23.tar.gz.

File metadata

  • Download URL: sec2md-0.1.23.tar.gz
  • Upload date:
  • Size: 69.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sec2md-0.1.23.tar.gz
Algorithm Hash digest
SHA256 a318cb51b957afd529ec3c695ddf3b3ed82ff84f9808c34dfcae1877c2a59dc1
MD5 641933bc8cf6e623aa464bc5fa5b6ed2
BLAKE2b-256 a778b3e2e58152c4bf4207a86e60d0850ae2bcd1a4e4bb3a4929746f21e6bf74

See more details on using hashes here.

File details

Details for the file sec2md-0.1.23-py3-none-any.whl.

File metadata

  • Download URL: sec2md-0.1.23-py3-none-any.whl
  • Upload date:
  • Size: 56.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sec2md-0.1.23-py3-none-any.whl
Algorithm Hash digest
SHA256 153fa593cb98ac7e355b1bd3362c108cf2f167d5bf20847436daba46d4a3904d
MD5 ec209729ff9bc5a7a76faa6618de92de
BLAKE2b-256 14675295e19b1320f651846ec4b3a267ca2f5ac040ba132a5cb5aed49230a681

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page