Robust structured data extraction from markdown text, built with literate programming using marimo notebooks

These details have not been verified by PyPI

Project links

Project description

Markdown Table Extractor

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks.

✨ Features

✅ Proper separator detection - Handles :--:, ---:, :--- alignment syntax
✅ Caption detection - Finds "Table 3. Results" patterns above tables
✅ Continuation merging - Automatically merges "Table 3 (Continued)" tables
✅ HTML cleanup - Removes <br> and other artifacts from headers
✅ Sub-header handling - Merges multi-level headers properly
✅ LLM extraction - AI-powered fallback for complex edge cases
✅ Type-safe - Full type hints with py.typed marker
✅ Literate programming - Each module is a marimo notebook

📚 Literate Programming with Marimo

Unlike traditional packages, every module is a marimo notebook:

src/markdown_table_extractor/core/
├── models.py      # 📓 Data classes notebook
├── parser.py      # 📓 Markdown parsing notebook  
├── cleaner.py     # 📓 Data cleaning notebook
├── merger.py      # 📓 Table merging notebook
└── extractor.py   # 📓 Main extraction notebook

Each file is simultaneously:

A Python module - Import normally: from markdown_table_extractor import extract_tables
A notebook - Edit interactively: marimo edit src/.../parser.py
Documentation - Read the code alongside explanations
A script - Run standalone: python src/.../parser.py

This follows the literate programming paradigm pioneered by Donald Knuth, similar to nbdev but with marimo's pure-Python notebooks.

Installation

# Using uv (recommended)
uv add markdown-table-extractor

# With LLM support (OpenAI/Anthropic)
uv add "markdown-table-extractor[llm]"

# Development with marimo
uv add "markdown-table-extractor[notebook]"

Quick Start

Simple Usage

from markdown_table_extractor import extract_tables

markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""

tables = extract_tables(markdown)
print(tables[0])
#    Name  Age
# 0  Alice   30
# 1    Bob   25

Full API with Metadata

from markdown_table_extractor import extract_markdown_tables, TableMergeStrategy

result = extract_markdown_tables(
    markdown_text,
    merge_strategy=TableMergeStrategy.IDENTICAL_HEADERS,
    detect_captions=True,
    skip_sub_headers=True
)

for table in result:
    print(f"Caption: {table.caption}")
    print(f"Shape: {table.dataframe.shape}")
    print(table.dataframe)

Exploring the Notebooks

# Open any module as an interactive notebook
marimo edit src/markdown_table_extractor/core/parser.py

# Run the interactive demo
marimo edit src/markdown_table_extractor/core/extractor.py

API Reference

Core Functions

Function	Description
`extract_tables(text)`	Simple API returning `list[DataFrame]`
`extract_markdown_tables(text, ...)`	Full API returning `ExtractionResult`

Models

Class	Description
`ExtractedTable`	Single table with metadata (caption, line numbers)
`ExtractionResult`	Collection of tables with errors and merge count
`TableMergeStrategy`	Enum: `NONE`, `IDENTICAL_HEADERS`, `COMPATIBLE_COLUMNS`

Utilities (from parser.py notebook)

Function	Description
`is_separator_row(line)`	Check if line is a table separator
`is_table_row(line)`	Check if line is any table row
`parse_table_row(line)`	Extract cell values from a row
`detect_caption(lines, start)`	Find table caption

Utilities (from cleaner.py notebook)

Function	Description
`clean_column_name(name)`	Remove HTML artifacts from header
`headers_match(h1, h2)`	Check if headers are compatible
`normalize_headers(headers)`	Canonical form for comparison

Development

# Clone and setup
git clone https://github.com/username/markdown-table-extractor
cd markdown-table-extractor
uv sync --all-extras

# Run tests
uv run pytest

# Run linter
uv run ruff check src tests

# Type checking
uv run mypy src

# Edit any module interactively
uv run marimo edit src/markdown_table_extractor/core/parser.py

Project Structure

markdown-table-extractor/
├── pyproject.toml              # Package configuration
├── README.md
├── src/
│   └── markdown_table_extractor/
│       ├── __init__.py         # Public API
│       ├── py.typed            # Type hints marker
│       └── core/
│           ├── __init__.py     # Re-exports from notebooks
│           ├── models.py       # 📓 Data classes notebook
│           ├── parser.py       # 📓 Parsing utilities notebook
│           ├── cleaner.py      # 📓 Cleaning utilities notebook
│           ├── merger.py       # 📓 Merging logic notebook
│           └── extractor.py    # 📓 Main extraction notebook
└── tests/
    └── test_extractor.py

Why Marimo Notebooks?

Traditional	Marimo Literate
Code in .py, docs separate	Code + docs in same file
Read code, guess intent	Read explanation alongside code
Tests in separate files	Interactive tests in notebook
Static documentation	Interactive documentation
Jupyter: JSON format	Pure Python, Git-friendly

Why Not nbdev?

nbdev	marimo
Jupyter notebooks (JSON)	Pure Python files
Requires export step	Direct import
`#	export` directives
Complex build process	Just Python

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Edit the notebooks, add tests, and submit PRs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Nov 27, 2025

0.1.1

Nov 27, 2025

This version

0.1.0

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_table_extractor-0.1.0.tar.gz (46.1 kB view details)

Uploaded Nov 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markdown_table_extractor-0.1.0-py3-none-any.whl (55.7 kB view details)

Uploaded Nov 27, 2025 Python 3

File details

Details for the file markdown_table_extractor-0.1.0.tar.gz.

File metadata

Download URL: markdown_table_extractor-0.1.0.tar.gz
Upload date: Nov 27, 2025
Size: 46.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for markdown_table_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`26ba47633f0a1eb3ddca212b5043993ea984c7d5f078e8380f52bb2c99375f26`
MD5	`e9dcb553bef1a035bb9f9c8bde82e4dd`
BLAKE2b-256	`a1d3991fa00d7abd21950b82b7fd73580cd69688c3121030291a682b8d790fca`

See more details on using hashes here.

File details

Details for the file markdown_table_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: markdown_table_extractor-0.1.0-py3-none-any.whl
Upload date: Nov 27, 2025
Size: 55.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for markdown_table_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11607e8239f46c4fef59a963ca49becec9ce5830c705a8f1158cfa2ce678a69a`
MD5	`ee49e222c96c26f1981493b7e5fe77f4`
BLAKE2b-256	`289d113b2ee1717198a55f92e7a748c15fc227475c1028a65ac32785838a57e5`

See more details on using hashes here.

markdown-table-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Markdown Table Extractor

✨ Features

📚 Literate Programming with Marimo

Installation

Quick Start

Simple Usage

Full API with Metadata

Exploring the Notebooks

API Reference

Core Functions

Models

Utilities (from parser.py notebook)

Utilities (from cleaner.py notebook)

Development

Project Structure

Why Marimo Notebooks?

Why Not nbdev?

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes