Skip to main content

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks

Project description

Markdown Table Extractor

Python 3.11+ uv marimo License: MIT

Robust structured data extraction from markdown text, built with literate programming using marimo notebooks.

✨ Features

  • Proper separator detection - Handles :--:, ---:, :--- alignment syntax
  • Caption detection - Finds "Table 3. Results" patterns above tables
  • Continuation merging - Automatically merges "Table 3 (Continued)" tables
  • HTML cleanup - Removes <br> and other artifacts from headers
  • Sub-header handling - Merges multi-level headers properly
  • LLM extraction - AI-powered fallback for complex edge cases
  • Type-safe - Full type hints with py.typed marker
  • Literate programming - Each module is a marimo notebook

📚 Literate Programming with Marimo

Unlike traditional packages, every module is a marimo notebook:

src/markdown_table_extractor/core/
├── models.py      # 📓 Data classes notebook
├── parser.py      # 📓 Markdown parsing notebook  
├── cleaner.py     # 📓 Data cleaning notebook
├── merger.py      # 📓 Table merging notebook
└── extractor.py   # 📓 Main extraction notebook

Each file is simultaneously:

  • A Python module - Import normally: from markdown_table_extractor import extract_tables
  • A notebook - Edit interactively: marimo edit src/.../parser.py
  • Documentation - Read the code alongside explanations
  • A script - Run standalone: python src/.../parser.py

This follows the literate programming paradigm pioneered by Donald Knuth, similar to nbdev but with marimo's pure-Python notebooks.

Installation

# Using uv (recommended)
uv add markdown-table-extractor

# With LLM support (OpenAI/Anthropic)
uv add "markdown-table-extractor[llm]"

# Development with marimo
uv add "markdown-table-extractor[notebook]"

Quick Start

Simple Usage

from markdown_table_extractor import extract_tables

markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""

tables = extract_tables(markdown)
print(tables[0])
#    Name  Age
# 0  Alice   30
# 1    Bob   25

Full API with Metadata

from markdown_table_extractor import extract_markdown_tables, TableMergeStrategy

result = extract_markdown_tables(
    markdown_text,
    merge_strategy=TableMergeStrategy.IDENTICAL_HEADERS,
    detect_captions=True,
    skip_sub_headers=True
)

for table in result:
    print(f"Caption: {table.caption}")
    print(f"Shape: {table.dataframe.shape}")
    print(table.dataframe)

Exploring the Notebooks

# Open any module as an interactive notebook
marimo edit src/markdown_table_extractor/core/parser.py

# Run the interactive demo
marimo edit src/markdown_table_extractor/core/extractor.py

API Reference

Core Functions

Function Description
extract_tables(text) Simple API returning list[DataFrame]
extract_markdown_tables(text, ...) Full API returning ExtractionResult

Models

Class Description
ExtractedTable Single table with metadata (caption, line numbers)
ExtractionResult Collection of tables with errors and merge count
TableMergeStrategy Enum: NONE, IDENTICAL_HEADERS, COMPATIBLE_COLUMNS

Utilities (from parser.py notebook)

Function Description
is_separator_row(line) Check if line is a table separator
is_table_row(line) Check if line is any table row
parse_table_row(line) Extract cell values from a row
detect_caption(lines, start) Find table caption

Utilities (from cleaner.py notebook)

Function Description
clean_column_name(name) Remove HTML artifacts from header
headers_match(h1, h2) Check if headers are compatible
normalize_headers(headers) Canonical form for comparison

Development

# Clone and setup
git clone https://github.com/username/markdown-table-extractor
cd markdown-table-extractor
uv sync --all-extras

# Run tests
uv run pytest

# Run linter
uv run ruff check src tests

# Type checking
uv run mypy src

# Edit any module interactively
uv run marimo edit src/markdown_table_extractor/core/parser.py

Project Structure

markdown-table-extractor/
├── pyproject.toml              # Package configuration
├── README.md
├── src/
│   └── markdown_table_extractor/
│       ├── __init__.py         # Public API
│       ├── py.typed            # Type hints marker
│       └── core/
│           ├── __init__.py     # Re-exports from notebooks
│           ├── models.py       # 📓 Data classes notebook
│           ├── parser.py       # 📓 Parsing utilities notebook
│           ├── cleaner.py      # 📓 Cleaning utilities notebook
│           ├── merger.py       # 📓 Merging logic notebook
│           └── extractor.py    # 📓 Main extraction notebook
└── tests/
    └── test_extractor.py

Why Marimo Notebooks?

Traditional Marimo Literate
Code in .py, docs separate Code + docs in same file
Read code, guess intent Read explanation alongside code
Tests in separate files Interactive tests in notebook
Static documentation Interactive documentation
Jupyter: JSON format Pure Python, Git-friendly

Why Not nbdev?

nbdev marimo
Jupyter notebooks (JSON) Pure Python files
Requires export step Direct import
`# export` directives
Complex build process Just Python

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Edit the notebooks, add tests, and submit PRs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_table_extractor-0.1.0.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdown_table_extractor-0.1.0-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file markdown_table_extractor-0.1.0.tar.gz.

File metadata

File hashes

Hashes for markdown_table_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 26ba47633f0a1eb3ddca212b5043993ea984c7d5f078e8380f52bb2c99375f26
MD5 e9dcb553bef1a035bb9f9c8bde82e4dd
BLAKE2b-256 a1d3991fa00d7abd21950b82b7fd73580cd69688c3121030291a682b8d790fca

See more details on using hashes here.

File details

Details for the file markdown_table_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markdown_table_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11607e8239f46c4fef59a963ca49becec9ce5830c705a8f1158cfa2ce678a69a
MD5 ee49e222c96c26f1981493b7e5fe77f4
BLAKE2b-256 289d113b2ee1717198a55f92e7a748c15fc227475c1028a65ac32785838a57e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page