Skip to main content

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks

Project description

Markdown Table Extractor

Python 3.11+ uv marimo License: MIT

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks.

✨ Features

  • Proper separator detection - Handles :--:, ---:, :--- alignment syntax
  • Caption detection - Finds "Table 3. Results" patterns above tables
  • Continuation merging - Automatically merges "Table 3 (Continued)" tables
  • HTML cleanup - Removes <br> and other artifacts from headers
  • Sub-header handling - Merges multi-level headers properly
  • LLM extraction - AI-powered fallback for complex edge cases
  • Type-safe - Full type hints with py.typed marker
  • Literate programming - Each module is a marimo notebook

📚 Literate Programming with Marimo

Unlike traditional packages, every module is a marimo notebook:

src/markdown_table_extractor/core/
├── models.py      # 📓 Data classes notebook
├── parser.py      # 📓 Markdown parsing notebook  
├── cleaner.py     # 📓 Data cleaning notebook
├── merger.py      # 📓 Table merging notebook
└── extractor.py   # 📓 Main extraction notebook

Each file is simultaneously:

  • A Python module - Import normally: from markdown_table_extractor import extract_tables
  • A notebook - Edit interactively: marimo edit src/.../parser.py
  • Documentation - Read the code alongside explanations
  • A script - Run standalone: python src/.../parser.py

This follows the literate programming paradigm pioneered by Donald Knuth, similar to nbdev but with marimo's pure-Python notebooks.

Installation

# Using uv (recommended)
uv add markdown-table-extractor

# With LLM support (OpenAI/Anthropic)
uv add "markdown-table-extractor[llm]"

# Development with marimo
uv add "markdown-table-extractor[notebook]"

Quick Start

Simple Usage

from markdown_table_extractor import extract_tables

markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""

tables = extract_tables(markdown)
print(tables[0])
#    Name  Age
# 0  Alice   30
# 1    Bob   25

Full API with Metadata

from markdown_table_extractor import extract_markdown_tables, TableMergeStrategy

result = extract_markdown_tables(
    markdown_text,
    merge_strategy=TableMergeStrategy.IDENTICAL_HEADERS,
    detect_captions=True,
    skip_sub_headers=True
)

for table in result:
    print(f"Caption: {table.caption}")
    print(f"Shape: {table.dataframe.shape}")
    print(table.dataframe)

Exploring the Notebooks

# Open any module as an interactive notebook
marimo edit src/markdown_table_extractor/core/parser.py

# Run the interactive demo
marimo edit src/markdown_table_extractor/core/extractor.py

API Reference

Core Functions

Function Description
extract_tables(text) Simple API returning list[DataFrame]
extract_markdown_tables(text, ...) Full API returning ExtractionResult

Models

Class Description
ExtractedTable Single table with metadata (caption, line numbers)
ExtractionResult Collection of tables with errors and merge count
TableMergeStrategy Enum: NONE, IDENTICAL_HEADERS, COMPATIBLE_COLUMNS

Utilities (from parser.py notebook)

Function Description
is_separator_row(line) Check if line is a table separator
is_table_row(line) Check if line is any table row
parse_table_row(line) Extract cell values from a row
detect_caption(lines, start) Find table caption

Utilities (from cleaner.py notebook)

Function Description
clean_column_name(name) Remove HTML artifacts from header
headers_match(h1, h2) Check if headers are compatible
normalize_headers(headers) Canonical form for comparison

Development

# Clone and setup
git clone https://github.com/username/markdown-table-extractor
cd markdown-table-extractor
uv sync --all-extras

# Run tests
uv run pytest

# Run linter
uv run ruff check src tests

# Type checking
uv run mypy src

# Edit any module interactively
uv run marimo edit src/markdown_table_extractor/core/parser.py

Project Structure

markdown-table-extractor/
├── pyproject.toml              # Package configuration
├── README.md
├── src/
│   └── markdown_table_extractor/
│       ├── __init__.py         # Public API
│       ├── py.typed            # Type hints marker
│       └── core/
│           ├── __init__.py     # Re-exports from notebooks
│           ├── models.py       # 📓 Data classes notebook
│           ├── parser.py       # 📓 Parsing utilities notebook
│           ├── cleaner.py      # 📓 Cleaning utilities notebook
│           ├── merger.py       # 📓 Merging logic notebook
│           └── extractor.py    # 📓 Main extraction notebook
└── tests/
    └── test_extractor.py

Why Marimo Notebooks?

Traditional Marimo Literate
Code in .py, docs separate Code + docs in same file
Read code, guess intent Read explanation alongside code
Tests in separate files Interactive tests in notebook
Static documentation Interactive documentation
Jupyter: JSON format Pure Python, Git-friendly

Why Not nbdev?

nbdev marimo
Jupyter notebooks (JSON) Pure Python files
Requires export step Direct import
`# export` directives
Complex build process Just Python

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Edit the notebooks, add tests, and submit PRs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_table_extractor-0.1.1.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdown_table_extractor-0.1.1-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file markdown_table_extractor-0.1.1.tar.gz.

File metadata

File hashes

Hashes for markdown_table_extractor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f19198a1a0febb5d37a132aad1edbfe535604df0eef5fc3470e94730e753e9ff
MD5 3d78728fc292c7c48c2c71245df37912
BLAKE2b-256 5df6c376a21fc491bd04811ec6cbc629d34919678efc2f7c60a0f5c3b78dc610

See more details on using hashes here.

File details

Details for the file markdown_table_extractor-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for markdown_table_extractor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4cf5c98d01a0383ddd7043aa188a12a747e237a00a2dcb4e9a8209ac73ca8d17
MD5 4dbe06e7a85c9b9d07d9f7b3921ffb79
BLAKE2b-256 0e76b6df68e8770be766a872c3d22bd33ee728c2858fc4653ab1a542bd52e795

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page