Extract structured sections from Markdown by header — bracket-style access with robust handling of code blocks, tables, math, and YAML front matter.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fasilwdr

These details have not been verified by PyPI

Project description

markdown-extractor logo

markdown-extractor

Turn a Markdown document into a navigable tree of sections keyed by header, then drop into the body of any section to get its blocks, plain text, JSON, or HTML — with optional XPath filtering.

Header tree — bracket access, recursive search, ASCII tree.
Block tree — paragraphs, ordered/unordered lists with nesting, code fences, blockquotes — parsed lazily per section.
Renderers — to_list(), to_dict(), to_json(), to_html(), to_text(), all on both MDExtractor and any Section.
Robust parsing — headers inside code blocks, tables, math blocks, and YAML front matter are correctly ignored.
Zero runtime dependencies for everything except XPath (opt-in extra).
Pure Python, >= 3.8. Lazy slicing — large documents stay cheap.

Installation

pip install markdown-extractor

For XPath support on to_html():

pip install markdown-extractor[xpath]

From a local checkout:

pip install -e .

Quick start

from markdown_extractor import MDExtractor

md = """
# Section 1
Some content here.

- **Lightweight** — small footprint.
- *Flexible* — extensible by design.
- `Tested` — full coverage.

## Subsection 1.1
More details.

## FAQ
- **Which versions are supported?**

    Versions 1.0 and up.
- **Where do I report bugs?**

    Open an issue on GitHub.
"""

e = MDExtractor(md)

e["Section 1"]                       # bracket access by title
e["Section 1"]["Subsection 1.1"]     # nested
e.get("Section 1", "Subsection 1.1") # multi-step (soft — see below)
e.list()                             # ['Section 1']
e[""]                                # synthetic root: the whole document

Every example below uses this same md. Load straight from disk instead:

e = MDExtractor.from_file("docs/guide.md")

Navigating the header tree

Strict access — `[]` / `get_section()`

e["Section 1"]                          # KeyError if missing
e.get_section("Section 1", "Sub 1.1")   # KeyError if any title is missing
"Section 1" in e

Soft access — `get()`

get() walks the same path but returns a null section sentinel instead of raising when something is missing. The sentinel is falsy and its renderer methods all return empty values, so chains stay safe:

e.get("Section 1", "Subsection 1.1").to_list()   # works
e.get("Nope", "Anything").to_list()              # []  — no exception
e.get("Nope").to_html()                          # ""
e.get("Nope").to_dict()                          # {'title':'', 'level':0, ...}

if not e.get("Optional Section"):
    print("section absent")

Use []/get_section() when you want missing keys to fail loudly; use get() when you want to flow through to an empty result.

Discovery

e.list()                              # immediate child titles
e["Section 1"].list()                 # children of "Section 1"
e.find("Subsection 1.1")              # every section with that title
e.walk()                              # depth-first iterator over every header
e.tree()                              # ASCII tree of the whole document

Reading a section's body

Every Section parses its own body lazily into a tree of blocks (paragraphs / lists / list items / code / blockquote). This view is what powers to_list, to_dict, to_html, and to_text.

s = e["Section 1"]

s.blocks       # the parsed Block tree (lazy, cached)
s.text         # raw prose, no header line, no subsections
s.body         # raw prose + nested subsections
s.content      # the full slice including the header line

Reaching a single block — `block()` and `text_plain`

When you've isolated a single piece of text inside the block tree, Block.text_plain strips the inline Markdown markers so you don't have to round-trip through to_text():

s.blocks[1].children[1].text          # '*Flexible* — extensible by design.'
s.blocks[1].children[1].text_plain    # 'Flexible — extensible by design.'

block.text is kept raw on purpose — to_dict() / to_json() is a lossless round-trip of the source. text_plain is the on-demand plain-text view.

For chains that might break (a missing list item, an empty section), use Section.block(*indices) and Block.get(*indices). These return a null Block on out-of-range indices instead of raising IndexError — so the chain stays safe end-to-end:

s.block(1, 1).text_plain            # 'Flexible — extensible by design.'
s.block(99, 99).text_plain          # ''  — no exception

# Equivalent two-step form (block() walks .blocks; .get() walks .children):
s.block(1).get(1).text_plain

bool(s.block(99))                   # False — null sentinel

section.blocks[i] / block.children[j] keep raising IndexError on out-of-range — strict access stays strict. Use block() / .get() only when you want soft fall-through.

`to_list()` — flatten body to strings

One entry per top-level block. Lists expand to one entry per top-level item; text is preserved raw (use text_plain per-item if you want markers stripped):

s.to_list()
# ['Some content here.',
#  '**Lightweight** — small footprint.',
#  '*Flexible* — extensible by design.',
#  '`Tested` — full coverage.']

`to_dict()` / `to_json()` — full structured output

Header subsections live under children; the body block tree lives under blocks. Indented continuation paragraphs under a bullet (FAQ-style) are attached as that bullet's children:

e["Section 1"]["FAQ"].to_dict()
# {
#   "title": "FAQ",
#   "level": 2,
#   "text": "...",
#   "blocks": [
#     {"kind": "list", "text": "", "children": [
#       {"kind": "list_item",
#        "text": "**Which versions are supported?**",
#        "children": [
#          {"kind": "paragraph", "text": "Versions 1.0 and up."}
#        ]},
#       {"kind": "list_item",
#        "text": "**Where do I report bugs?**",
#        "children": [
#          {"kind": "paragraph", "text": "Open an issue on GitHub."}
#        ]}
#     ]}
#   ],
#   "children": []
# }

e["Section 1"]["FAQ"].to_json(indent=2)

`to_text()` — Markdown stripped

Inline markers (**bold**, *em*, `code`, [link](url), ![alt](url)) are reduced to their visible text. Bullets become - lines, ordered items become 1. , nested children indent four spaces, and fenced code is kept verbatim:

print(s.to_text())
# Some content here.
#
# - Lightweight — small footprint.
# - Flexible — extensible by design.
# - Tested — full coverage.

`to_html()` — render to HTML, optionally filter with XPath

s.to_html()
# <p>Some content here.</p>
# <ul>
# <li><strong>Lightweight</strong> — small footprint.</li>
# <li><em>Flexible</em> — extensible by design.</li>
# <li><code>Tested</code> — full coverage.</li>
# </ul>

s.to_html(xpath=".//ul/li")
# ['<li><strong>Lightweight</strong> — small footprint.</li>',
#  '<li><em>Flexible</em> — extensible by design.</li>',
#  '<li><code>Tested</code> — full coverage.</li>']

s.to_html(xpath=".//strong")
# ['<strong>Lightweight</strong>']

Getting just the text value

When you want the data inside the matched elements rather than the markup, you have two options:

1. as_text=True — flatten each match to its text content, including text nested inside inline tags (<strong>, <em>, <code>, …):

s.to_html(xpath=".//ul/li", as_text=True)
# ['Lightweight — small footprint.',
#  'Flexible — extensible by design.',
#  'Tested — full coverage.']

s.to_html(xpath=".//ul/li[1]", as_text=True)
# ['Lightweight — small footprint.']

2. /text() in the XPath itself — works without the as_text flag, but only collects direct text nodes. Text wrapped in inline tags is skipped:

s.to_html(xpath=".//ul/li/text()")
# [' — small footprint.',
#  ' — extensible by design.',
#  ' — full coverage.']
# Note: 'Lightweight' / 'Flexible' / 'Tested' are missing — they sit
# inside <strong>/<em>/<code>, which /text() doesn't enter.

Use as_text=True when items contain inline formatting; use /text() when you specifically want only the loose text and not the wrapped content.

XPath uses lxml and is opt-in via the [xpath] extra:

pip install markdown-extractor[xpath]

Without lxml, plain to_html() still works — only to_html(xpath=...) raises ModuleNotFoundError with the install hint.

Robust extraction

The header parser walks the document with full block-context awareness, so a stray # is never mistaken for a header.

Block	Example	Behaviour
Fenced code	``` … ``` (or `~~~`)	Headers inside are ignored
Math block	`$$` … `$$`	Headers inside are ignored
Tables	`\| col \| col \|` rows	Cell contents are ignored
YAML front matter	`---` at line 1, closes on `---`/`...`	Whole block is ignored

---
title: My Doc
# not a real header
---

# Real Header

```python
# also not a header

definitely not a header


```python
MDExtractor(md).list()  # ['Real Header']

ATX headers (# … ######) and Setext underlines (=== / ---) are both recognised. Skip-level jumps (h1 → h3 → h2) are handled gracefully. Any leading whitespace is allowed before a header.

Slices for any granularity

Each Section exposes three text views:

Property	Includes header line?	Includes child sections?
`content`	yes	yes
`body`	no	yes
`text`	no	no — own prose only

API reference

`MDExtractor`

Member	Description
`MDExtractor(markdown)`	Parse a string.
`MDExtractor.from_file(path, encoding="utf-8")`	Read & parse a file.
`e[""]` / `e.root`	Synthetic root section (whole document).
`e["Title"]` / `e[i]`	Top-level child by title or index.
`"Title" in e`	Membership test.
`iter(e)` / `len(e)`	Iterate top-level children / count them.
`.list()`	Top-level header titles.
`.get_section(*path)`	Strict multi-step descent (raises).
`.get(*path)`	Soft multi-step descent (null sentinel on miss).
`.find(title)`	All sections (any depth) with that title.
`.walk()` / `.headers()`	Depth-first iterator / list of every header.
`.tree()`	ASCII tree of the document's header structure.
`.to_list()`	Body flattened to strings (proxies to root).
`.to_dict()` / `.to_json(**kw)`	Serialise the tree (with body blocks).
`.to_text()`	Body rendered as plain text.
`.to_html(xpath=None, as_text=False)`	Body rendered as HTML, optionally XPath-filtered (`as_text=True` returns text values).
`.block(*indices)`	Soft index into root's body block tree (null Block on miss).
`.content`	Original Markdown source.

`Section`

Member	Description
`.title` / `.level`	Header text and depth (1–6, or 0 for root).
`.parent` / `.children`	Tree links.
`.path`	Title chain from top-level ancestor down to this node.
`.content` / `.body` / `.text`	Raw text views (see table above).
`.blocks`	Lazy-parsed body block tree.
`section["Title"]` / `section[i]`	Child by title or index (strict).
`section.get(*path)`	Soft multi-step descent (null sentinel on miss).
`section.get_section(*path)`	Strict multi-step descent.
`"Title" in section`	Membership test.
`iter(section)` / `len(section)`	Iterate / count direct children.
`bool(section)`	`False` only for the null sentinel returned by `get()`.
`.list()`	Direct child titles.
`.find(title)`	Recursive search.
`.walk()`	Depth-first iterator over self + descendants.
`.to_list()`	Body flattened to one string per top-level block / item.
`.to_dict()`	Nested dict — `blocks` (body) and `children` (header subsections).
`.to_json(**kw)`	`json.dumps` of `to_dict()`.
`.to_text()`	Body rendered as plain text (Markdown markers stripped).
`.to_html(xpath=None, as_text=False)`	Body rendered as HTML, optionally XPath-filtered (`as_text=True` returns text values).
`.block(*indices)`	Soft index walk into the body block tree (null Block on miss).
`.tree()`	ASCII tree of this subsection.
`str(section)`	Same as `.content`.

`Block`

A node in the body block tree.

Member	Description
`.kind`	One of `paragraph`, `list`, `ordered_list`, `list_item`, `code`, `blockquote`.
`.text`	The block's own text — raw, with inline Markdown markers preserved.
`.text_plain`	`.text` with inline markers stripped (`bold` → `bold`, `[t](u)` → `t`, …).
`.children`	Nested blocks (list items, sub-lists, indented paragraphs).
`.info`	Code-fence language, e.g. `"python"`.
`.get(*indices)`	Soft index walk into `.children` (null Block on miss). Chainable.
`bool(block)`	`False` only for the null sentinel returned by `get()` / `Section.block()`.
`.walk()`	Yield this block and every descendant.
`.to_dict()`	JSON-friendly nested dict.

Development

pip install -e .[dev]
pytest

License

See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fasilwdr

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

May 13, 2026

This version

0.1.1

May 10, 2026

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_extractor-0.1.1.tar.gz (38.1 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markdown_extractor-0.1.1-py3-none-any.whl (31.3 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file markdown_extractor-0.1.1.tar.gz.

File metadata

Download URL: markdown_extractor-0.1.1.tar.gz
Upload date: May 10, 2026
Size: 38.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for markdown_extractor-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ed2e0fa452a35e3cb6095446ab4c0bea8f6370f0c242d10b4475443791d11e02`
MD5	`22ec364a6faff2d766469830835919eb`
BLAKE2b-256	`7fe63b09e00f2e8de674e5a870deb52226b939006018b15b04b9045a4dbfbb90`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdown_extractor-0.1.1.tar.gz:

Publisher: publish.yml on fasilwdr/MD-Extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdown_extractor-0.1.1.tar.gz
- Subject digest: ed2e0fa452a35e3cb6095446ab4c0bea8f6370f0c242d10b4475443791d11e02
- Sigstore transparency entry: 1491810238
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: fasilwdr/MD-Extractor@a0128fdd4931a6202c99d0ab6f6e87c95e72988a
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/fasilwdr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a0128fdd4931a6202c99d0ab6f6e87c95e72988a
- Trigger Event: release

File details

Details for the file markdown_extractor-0.1.1-py3-none-any.whl.

File metadata

Download URL: markdown_extractor-0.1.1-py3-none-any.whl
Upload date: May 10, 2026
Size: 31.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for markdown_extractor-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`009605e32977eac5dabfde571c4aaac8dae7fcfaa3526c6bebb97aa6067cc451`
MD5	`1c0489d6d55f8df9725568d5c9d32b80`
BLAKE2b-256	`bbd1686c456f7968b13ee7f53ea4bf35283280938a297cb48ca2ec100e5d3f8a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdown_extractor-0.1.1-py3-none-any.whl:

Publisher: publish.yml on fasilwdr/MD-Extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdown_extractor-0.1.1-py3-none-any.whl
- Subject digest: 009605e32977eac5dabfde571c4aaac8dae7fcfaa3526c6bebb97aa6067cc451
- Sigstore transparency entry: 1491810305
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: fasilwdr/MD-Extractor@a0128fdd4931a6202c99d0ab6f6e87c95e72988a
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/fasilwdr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a0128fdd4931a6202c99d0ab6f6e87c95e72988a
- Trigger Event: release

markdown-extractor 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

markdown-extractor

Installation

Quick start

Navigating the header tree

Strict access — [] / get_section()

Soft access — get()

Discovery

Reading a section's body

Reaching a single block — block() and text_plain

to_list() — flatten body to strings

to_dict() / to_json() — full structured output

to_text() — Markdown stripped

to_html() — render to HTML, optionally filter with XPath

Getting just the text value

Robust extraction

definitely not a header

Slices for any granularity

API reference

MDExtractor

Section

Block

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Strict access — `[]` / `get_section()`

Soft access — `get()`

Reaching a single block — `block()` and `text_plain`

`to_list()` — flatten body to strings

`to_dict()` / `to_json()` — full structured output

`to_text()` — Markdown stripped

`to_html()` — render to HTML, optionally filter with XPath

`MDExtractor`

`Section`

`Block`