Skip to main content

Extract structured sections from Markdown by header — bracket-style access with robust handling of code blocks, tables, math, and YAML front matter.

Project description

markdown-extractor logo

markdown-extractor

GitHub Release Python Version Downloads

Turn a Markdown document into a navigable tree of sections keyed by header, then drop into the body of any section to get its blocks, plain text, JSON, or HTML — with optional XPath filtering.

  • Header tree — bracket access, recursive search, ASCII tree.
  • Block tree — paragraphs, ordered/unordered lists with nesting, code fences, blockquotes — parsed lazily per section.
  • Renderersto_list(), to_dict(), to_json(), to_html(), to_text(), all on both MDExtractor and any Section.
  • Robust parsing — headers inside code blocks, tables, math blocks, and YAML front matter are correctly ignored.
  • Zero runtime dependencies for everything except XPath (opt-in extra).
  • Pure Python, >= 3.8. Lazy slicing — large documents stay cheap.

Installation

pip install markdown-extractor

For XPath support on to_html():

pip install markdown-extractor[xpath]

From a local checkout:

pip install -e .

Quick start

from markdown_extractor import MDExtractor

md = """
# Section 1
Some content here.

- **Lightweight** — small footprint.
- *Flexible* — extensible by design.
- `Tested` — full coverage.

## Subsection 1.1
More details.

## FAQ
- **Which versions are supported?**

    Versions 1.0 and up.
- **Where do I report bugs?**

    Open an issue on GitHub.
"""

e = MDExtractor(md)

e["Section 1"]                       # bracket access by title
e["Section 1"]["Subsection 1.1"]     # nested
e.get("Section 1", "Subsection 1.1") # multi-step (soft — see below)
e.list()                             # ['Section 1']
e[""]                                # synthetic root: the whole document

Every example below uses this same md. Load straight from disk instead:

e = MDExtractor.from_file("docs/guide.md")

Navigating the header tree

Strict access — [] / get_section()

e["Section 1"]                          # KeyError if missing
e.get_section("Section 1", "Sub 1.1")   # KeyError if any title is missing
"Section 1" in e

Soft access — get()

get() walks the same path but returns a null section sentinel instead of raising when something is missing. The sentinel is falsy and its renderer methods all return empty values, so chains stay safe:

e.get("Section 1", "Subsection 1.1").to_list()   # works
e.get("Nope", "Anything").to_list()              # []  — no exception
e.get("Nope").to_html()                          # ""
e.get("Nope").to_dict()                          # {'title':'', 'level':0, ...}

if not e.get("Optional Section"):
    print("section absent")

Use []/get_section() when you want missing keys to fail loudly; use get() when you want to flow through to an empty result.

Discovery

e.list()                              # immediate child titles
e["Section 1"].list()                 # children of "Section 1"
e.find("Subsection 1.1")              # every section with that title
e.walk()                              # depth-first iterator over every header
e.tree()                              # ASCII tree of the whole document

Reading a section's body

Every Section parses its own body lazily into a tree of blocks (paragraphs / lists / list items / code / blockquote). This view is what powers to_list, to_dict, to_html, and to_text.

s = e["Section 1"]

s.blocks       # the parsed Block tree (lazy, cached)
s.text         # raw prose, no header line, no subsections
s.body         # raw prose + nested subsections
s.content      # the full slice including the header line

Reaching a single block — block() and text_plain

When you've isolated a single piece of text inside the block tree, Block.text_plain strips the inline Markdown markers so you don't have to round-trip through to_text():

s.blocks[1].children[1].text          # '*Flexible* — extensible by design.'
s.blocks[1].children[1].text_plain    # 'Flexible — extensible by design.'

block.text is kept raw on purpose — to_dict() / to_json() is a lossless round-trip of the source. text_plain is the on-demand plain-text view.

For chains that might break (a missing list item, an empty section), use Section.block(*indices) and Block.get(*indices). These return a null Block on out-of-range indices instead of raising IndexError — so the chain stays safe end-to-end:

s.block(1, 1).text_plain            # 'Flexible — extensible by design.'
s.block(99, 99).text_plain          # ''  — no exception

# Equivalent two-step form (block() walks .blocks; .get() walks .children):
s.block(1).get(1).text_plain

bool(s.block(99))                   # False — null sentinel

section.blocks[i] / block.children[j] keep raising IndexError on out-of-range — strict access stays strict. Use block() / .get() only when you want soft fall-through.

.filtered(**kwargs) — narrow any collection in place

Collection accessors return BlockList / SectionList, both list subclasses that add a .filtered(**kwargs) method for chainable, attribute-based narrowing. The kwargs form keeps the API usable from Jinja2 / Django templates (which can't define lambdas). Existing list operations (indexing, iteration, len(), isinstance(x, list)) keep working unchanged.

s = e["Section 1"]

# Equality (most common):
s.blocks.filtered(kind="paragraph")
e["Section 1"].children.filtered(level=2)

# Multiple kwargs AND together:
s.blocks.filtered(kind="code", info="python")

# Operator suffixes:
e.headers().filtered(level__gte=2)
e.headers().filtered(level__in=[2, 3])
e.headers().filtered(title__startswith="Sub")

# Chains keep the typed return — both results are SectionList:
e.headers() \
    .filtered(level=2) \
    .filtered(title__startswith="A")

# No kwargs → a shallow copy of the same subclass.
e.headers().filtered()

Supported operator suffixes:

Suffix Meaning
(none) ==
__ne !=
__lt, __lte, __gt, __gte comparison
__in membership in iterable
__contains substring (b in a) or container-in
__startswith, __endswith string prefix / suffix

Items missing the requested attribute are treated as non-matches — no exception. Slices preserve the subclass too, so section.children[1:].filtered(level=2) works.

The method is available on:

Returns BlockList Returns SectionList
Section.blocks Section.children
Block.children Section.find(title)
Block.walk() Section.walk()
MDExtractor.find(title)
MDExtractor.walk()
MDExtractor.headers()

.mapped(path) — extract / flatten across a collection

BlockList, SectionList, Section, and Block all expose a .mapped(path) method — dotted-path attribute traversal that flattens list-valued attributes and keeps the typed return so you can keep chaining .filtered(...). Calling it on a single Section / Block behaves like a one-element collection, so the same path rules apply.

s = e["Section 1"]

# Pull every body block from every direct subsection (flat BlockList):
s.children.mapped("blocks")

# Dotted paths walk further — every list_item inside every top-level
# block of Section 1, fully flattened:
s.blocks.mapped("children")             # → 3 list_items (the bullets)

# Two-level dotted path — every inline token under those bullets:
s.blocks.mapped("children.inlines")
# → [bold 'Lightweight', text ' — …', em 'Flexible', text ' — …',
#    code 'Tested',     text ' — …']

# Equivalent — chained calls produce the same result as the dotted path:
s.blocks.mapped("children.inlines") == s.blocks.mapped("children").mapped("inlines")

# Chain with .filtered() — still a BlockList:
s.blocks.mapped("children").filtered(kind="list_item")

# Scalar attributes return a plain list:
s.children.mapped("title")              # ['Subsection 1.1', 'FAQ']
e.headers().mapped("title")             # ['Section 1', 'Subsection 1.1', 'FAQ']

# Works on a single Section / Block too — treated as a one-element collection:
s.mapped("title")                       # ['Section 1']
s.mapped("children.blocks")             # BlockList — every block under every child
s.blocks[1].mapped("children.inlines")  # same dotted-path rules from a single Block

Behaviour notes:

  • List-valued attributes (e.g. .blocks, .inlines, .children) are flattened into the result.
  • Scalar attributes (e.g. .title, .kind, .text) are appended; the return is a plain list (no .filtered() chaining).
  • The concrete subclass is preserved when every step yields the same FilteredList subclass, so the typed-return chain … .mapped("blocks").filtered(kind="paragraph") keeps working.
  • Items missing the attribute are skipped silently — same convention as .filtered().
  • coll.mapped("") returns a shallow copy of the same subclass.
  • On a single Section / Block, record.mapped(path) is equivalent to wrapping it in a one-element list and mapping — so a scalar path returns a list of one (e.g. section.mapped("title") == [section.title]).

to_list() — flatten body to strings

One entry per top-level block. Lists expand to one entry per top-level item; text is preserved raw (use text_plain per-item if you want markers stripped):

s.to_list()
# ['Some content here.',
#  '**Lightweight** — small footprint.',
#  '*Flexible* — extensible by design.',
#  '`Tested` — full coverage.']

to_dict() / to_json() — full structured output

Header subsections live under children; the body block tree lives under blocks. Indented continuation paragraphs under a bullet (FAQ-style) are attached as that bullet's children:

e["Section 1"]["FAQ"].to_dict()
# {
#   "title": "FAQ",
#   "level": 2,
#   "text": "...",
#   "blocks": [
#     {"kind": "list", "text": "", "children": [
#       {"kind": "list_item",
#        "text": "**Which versions are supported?**",
#        "children": [
#          {"kind": "paragraph", "text": "Versions 1.0 and up."}
#        ]},
#       {"kind": "list_item",
#        "text": "**Where do I report bugs?**",
#        "children": [
#          {"kind": "paragraph", "text": "Open an issue on GitHub."}
#        ]}
#     ]}
#   ],
#   "children": []
# }

e["Section 1"]["FAQ"].to_json(indent=2)

to_text() — Markdown stripped

Inline markers (**bold**, *em*, `code`, [link](url), ![alt](url)) are reduced to their visible text. Bullets become - lines, ordered items become 1. , nested children indent four spaces, and fenced code is kept verbatim:

print(s.to_text())
# Some content here.
#
# - Lightweight — small footprint.
# - Flexible — extensible by design.
# - Tested — full coverage.

to_html() — render to HTML, optionally filter with XPath

s.to_html()
# <p>Some content here.</p>
# <ul>
# <li><strong>Lightweight</strong> — small footprint.</li>
# <li><em>Flexible</em> — extensible by design.</li>
# <li><code>Tested</code> — full coverage.</li>
# </ul>

s.to_html(xpath=".//ul/li")
# ['<li><strong>Lightweight</strong> — small footprint.</li>',
#  '<li><em>Flexible</em> — extensible by design.</li>',
#  '<li><code>Tested</code> — full coverage.</li>']

s.to_html(xpath=".//strong")
# ['<strong>Lightweight</strong>']

Getting just the text value

When you want the data inside the matched elements rather than the markup, you have two options:

1. as_text=True — flatten each match to its text content, including text nested inside inline tags (<strong>, <em>, <code>, …):

s.to_html(xpath=".//ul/li", as_text=True)
# ['Lightweight — small footprint.',
#  'Flexible — extensible by design.',
#  'Tested — full coverage.']

s.to_html(xpath=".//ul/li[1]", as_text=True)
# ['Lightweight — small footprint.']

2. /text() in the XPath itself — works without the as_text flag, but only collects direct text nodes. Text wrapped in inline tags is skipped:

s.to_html(xpath=".//ul/li/text()")
# [' — small footprint.',
#  ' — extensible by design.',
#  ' — full coverage.']
# Note: 'Lightweight' / 'Flexible' / 'Tested' are missing — they sit
# inside <strong>/<em>/<code>, which /text() doesn't enter.

Use as_text=True when items contain inline formatting; use /text() when you specifically want only the loose text and not the wrapped content.

XPath uses lxml and is opt-in via the [xpath] extra:

pip install markdown-extractor[xpath]

Without lxml, plain to_html() still works — only to_html(xpath=...) raises ModuleNotFoundError with the install hint.


Robust extraction

The header parser walks the document with full block-context awareness, so a stray # is never mistaken for a header.

Block Example Behaviour
Fenced code `````` (or ~~~) Headers inside are ignored
Math block $$$$ Headers inside are ignored
Tables | col | col | rows Cell contents are ignored
YAML front matter --- at line 1, closes on ---/... Whole block is ignored
---
title: My Doc
# not a real header
---

# Real Header

```python
# also not a header

$$

definitely not a header

$$


```python
MDExtractor(md).list()  # ['Real Header']

ATX headers (#######) and Setext underlines (=== / ---) are both recognised. Skip-level jumps (h1 → h3 → h2) are handled gracefully. Any leading whitespace is allowed before a header.


Slices for any granularity

Each Section exposes three text views:

Property Includes header line? Includes child sections?
content yes yes
body no yes
text no no — own prose only

API reference

MDExtractor

Member Description
MDExtractor(markdown) Parse a string.
MDExtractor.from_file(path, encoding="utf-8") Read & parse a file.
e[""] / e.root Synthetic root section (whole document).
e["Title"] / e[i] Top-level child by title or index.
"Title" in e Membership test.
iter(e) / len(e) Iterate top-level children / count them.
.list() Top-level header titles.
.get_section(*path) Strict multi-step descent (raises).
.get(*path) Soft multi-step descent (null sentinel on miss).
.find(title) All sections (any depth) with that title.
.walk() / .headers() Depth-first iterator / list of every header.
.tree() ASCII tree of the document's header structure.
.to_list() Body flattened to strings (proxies to root).
.to_dict() / .to_json(**kw) Serialise the tree (with body blocks).
.to_text() Body rendered as plain text.
.to_html(xpath=None, as_text=False) Body rendered as HTML, optionally XPath-filtered (as_text=True returns text values).
.block(*indices) Soft index into root's body block tree (null Block on miss).
.content Original Markdown source.

Section

Member Description
.title / .level Header text and depth (1–6, or 0 for root).
.parent / .children Tree links.
.path Title chain from top-level ancestor down to this node.
.content / .body / .text Raw text views (see table above).
.blocks Lazy-parsed body block tree.
section["Title"] / section[i] Child by title or index (strict).
section.get(*path) Soft multi-step descent (null sentinel on miss).
section.get_section(*path) Strict multi-step descent.
"Title" in section Membership test.
iter(section) / len(section) Iterate / count direct children.
bool(section) False only for the null sentinel returned by get().
.list() Direct child titles.
.find(title) Recursive search.
.walk() Depth-first iterator over self + descendants.
.to_list() Body flattened to one string per top-level block / item.
.to_dict() Nested dict — blocks (body) and children (header subsections).
.to_json(**kw) json.dumps of to_dict().
.to_text() Body rendered as plain text (Markdown markers stripped).
.to_html(xpath=None, as_text=False) Body rendered as HTML, optionally XPath-filtered (as_text=True returns text values).
.block(*indices) Soft index walk into the body block tree (null Block on miss).
.tree() ASCII tree of this subsection.
str(section) Same as .content.

Block

A node in the body block tree.

Member Description
.kind One of paragraph, list, ordered_list, list_item, code, blockquote.
.text The block's own text — raw, with inline Markdown markers preserved.
.text_plain .text with inline markers stripped (**bold**bold, [t](u)t, …).
.children Nested blocks (list items, sub-lists, indented paragraphs).
.info Code-fence language, e.g. "python".
.get(*indices) Soft index walk into .children (null Block on miss). Chainable.
bool(block) False only for the null sentinel returned by get() / Section.block().
.walk() Yield this block and every descendant.
.to_dict() JSON-friendly nested dict.

Development

pip install -e .[dev]
pytest

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_extractor-0.2.0.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdown_extractor-0.2.0-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file markdown_extractor-0.2.0.tar.gz.

File metadata

  • Download URL: markdown_extractor-0.2.0.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for markdown_extractor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5a5302a04b6c256366e0c8bdc5a06f4d70f9a705e67bea565dfe5d39328a6981
MD5 6d1ee975fa762650994c6cb542c62430
BLAKE2b-256 c9089477fc6e1e5ff0d6e8df6d48bd2a86e408515df3e8ab25fa93cc27eaabff

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdown_extractor-0.2.0.tar.gz:

Publisher: publish.yml on fasilwdr/MD-Extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file markdown_extractor-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markdown_extractor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e47daf859db5d193e771ce04ab32adcf081d8d567e71fc455fbf9752fee83634
MD5 7144780be48438e66716c4795c48b2fc
BLAKE2b-256 f365a065f2f3b3c49140529910001eae85976c2eb6bd7c561574db9e95028ce7

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdown_extractor-0.2.0-py3-none-any.whl:

Publisher: publish.yml on fasilwdr/MD-Extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page