Skip to main content

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks

Project description

Markdown Table Extractor

Python 3.11+ uv marimo License: MIT

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks.

✨ Features

  • Proper separator detection - Handles :--:, ---:, :--- alignment syntax
  • Caption detection - Finds "Table 3. Results" patterns above tables
  • Continuation merging - Automatically merges "Table 3 (Continued)" tables
  • HTML cleanup - Removes <br> and other artifacts from headers
  • Sub-header handling - Merges multi-level headers properly
  • LLM extraction - AI-powered fallback for complex edge cases
  • Type-safe - Full type hints with py.typed marker
  • Literate programming - Each module is a marimo notebook

📚 Literate Programming with Marimo

Unlike traditional packages, every module is a marimo notebook:

src/markdown_table_extractor/core/
├── models.py      # 📓 Data classes notebook
├── parser.py      # 📓 Markdown parsing notebook  
├── cleaner.py     # 📓 Data cleaning notebook
├── merger.py      # 📓 Table merging notebook
└── extractor.py   # 📓 Main extraction notebook

Each file is simultaneously:

  • A Python module - Import normally: from markdown_table_extractor import extract_tables
  • A notebook - Edit interactively: marimo edit src/.../parser.py
  • Documentation - Read the code alongside explanations
  • A script - Run standalone: python src/.../parser.py

This follows the literate programming paradigm pioneered by Donald Knuth, similar to nbdev but with marimo's pure-Python notebooks.

Installation

# Using uv (recommended)
uv add markdown-table-extractor

# With LLM support (OpenAI/Anthropic)
uv add "markdown-table-extractor[llm]"

# Development with marimo
uv add "markdown-table-extractor[notebook]"

Quick Start

Simple Usage

from markdown_table_extractor import extract_tables

markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""

tables = extract_tables(markdown)
print(tables[0])
#    Name  Age
# 0  Alice   30
# 1    Bob   25

Full API with Metadata

from markdown_table_extractor import extract_markdown_tables, TableMergeStrategy

result = extract_markdown_tables(
    markdown_text,
    merge_strategy=TableMergeStrategy.IDENTICAL_HEADERS,
    detect_captions=True,
    skip_sub_headers=True
)

for table in result:
    print(f"Caption: {table.caption}")
    print(f"Shape: {table.dataframe.shape}")
    print(table.dataframe)

Exploring the Notebooks

# Open any module as an interactive notebook
marimo edit src/markdown_table_extractor/core/parser.py

# Run the interactive demo
marimo edit src/markdown_table_extractor/core/extractor.py

API Reference

Core Functions

Function Description
extract_tables(text) Simple API returning list[DataFrame]
extract_markdown_tables(text, ...) Full API returning ExtractionResult

Models

Class Description
ExtractedTable Single table with metadata (caption, line numbers)
ExtractionResult Collection of tables with errors and merge count
TableMergeStrategy Enum: NONE, IDENTICAL_HEADERS, COMPATIBLE_COLUMNS

Utilities (from parser.py notebook)

Function Description
is_separator_row(line) Check if line is a table separator
is_table_row(line) Check if line is any table row
parse_table_row(line) Extract cell values from a row
detect_caption(lines, start) Find table caption

Utilities (from cleaner.py notebook)

Function Description
clean_column_name(name) Remove HTML artifacts from header
headers_match(h1, h2) Check if headers are compatible
normalize_headers(headers) Canonical form for comparison

Development

# Clone and setup
git clone https://github.com/username/markdown-table-extractor
cd markdown-table-extractor
uv sync --all-extras

# Run tests
uv run pytest

# Run linter
uv run ruff check src tests

# Type checking
uv run mypy src

# Edit any module interactively
uv run marimo edit src/markdown_table_extractor/core/parser.py

Project Structure

markdown-table-extractor/
├── pyproject.toml              # Package configuration
├── README.md
├── src/
│   └── markdown_table_extractor/
│       ├── __init__.py         # Public API
│       ├── py.typed            # Type hints marker
│       └── core/
│           ├── __init__.py     # Re-exports from notebooks
│           ├── models.py       # 📓 Data classes notebook
│           ├── parser.py       # 📓 Parsing utilities notebook
│           ├── cleaner.py      # 📓 Cleaning utilities notebook
│           ├── merger.py       # 📓 Merging logic notebook
│           └── extractor.py    # 📓 Main extraction notebook
└── tests/
    └── test_extractor.py

Why Marimo Notebooks?

Traditional Marimo Literate
Code in .py, docs separate Code + docs in same file
Read code, guess intent Read explanation alongside code
Tests in separate files Interactive tests in notebook
Static documentation Interactive documentation
Jupyter: JSON format Pure Python, Git-friendly

Why Not nbdev?

nbdev marimo
Jupyter notebooks (JSON) Pure Python files
Requires export step Direct import
`# export` directives
Complex build process Just Python

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Edit the notebooks, add tests, and submit PRs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_table_extractor-0.1.2.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdown_table_extractor-0.1.2-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file markdown_table_extractor-0.1.2.tar.gz.

File metadata

File hashes

Hashes for markdown_table_extractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 19db4f1aa564d0191f8d4448993924830788a479546b86e261fb8bb54e79ee6e
MD5 86b040a390ee681266d0998db115db7a
BLAKE2b-256 7801f15a169711fbe88227a6cd8b35fa97b81c6610e294498be978e412a3d484

See more details on using hashes here.

File details

Details for the file markdown_table_extractor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for markdown_table_extractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ab45284af580c036e5b5b5624fd929dd653c5e09cd890cabe842b82bee0fea78
MD5 b4c739de3e4f034e630d70b5440e698e
BLAKE2b-256 5ae190a6cf6add3f92392dd18eb12cbe6650b3e07423e7d1d408e4454e15fe3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page