Robust structured data extraction from markdown text, built with literate programming using marimo notebooks
Project description
Markdown Table Extractor
Robust structured data extraction from markdown text, built with literate programming using marimo notebooks.
✨ Features
- ✅ Proper separator detection - Handles
:--:,---:,:---alignment syntax - ✅ Caption detection - Finds "Table 3. Results" patterns above tables
- ✅ Continuation merging - Automatically merges "Table 3 (Continued)" tables
- ✅ HTML cleanup - Removes
<br>and other artifacts from headers - ✅ Sub-header handling - Merges multi-level headers properly
- ✅ LLM extraction - AI-powered fallback for complex edge cases
- ✅ Type-safe - Full type hints with
py.typedmarker - ✅ Literate programming - Each module is a marimo notebook
📚 Literate Programming with Marimo
Unlike traditional packages, every module is a marimo notebook:
src/markdown_table_extractor/core/
├── models.py # 📓 Data classes notebook
├── parser.py # 📓 Markdown parsing notebook
├── cleaner.py # 📓 Data cleaning notebook
├── merger.py # 📓 Table merging notebook
└── extractor.py # 📓 Main extraction notebook
Each file is simultaneously:
- A Python module - Import normally:
from markdown_table_extractor import extract_tables - A notebook - Edit interactively:
marimo edit src/.../parser.py - Documentation - Read the code alongside explanations
- A script - Run standalone:
python src/.../parser.py
This follows the literate programming paradigm pioneered by Donald Knuth, similar to nbdev but with marimo's pure-Python notebooks.
Installation
# Using uv (recommended)
uv add markdown-table-extractor
# With LLM support (OpenAI/Anthropic)
uv add "markdown-table-extractor[llm]"
# Development with marimo
uv add "markdown-table-extractor[notebook]"
Quick Start
Simple Usage
from markdown_table_extractor import extract_tables
markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""
tables = extract_tables(markdown)
print(tables[0])
# Name Age
# 0 Alice 30
# 1 Bob 25
Full API with Metadata
from markdown_table_extractor import extract_markdown_tables, TableMergeStrategy
result = extract_markdown_tables(
markdown_text,
merge_strategy=TableMergeStrategy.IDENTICAL_HEADERS,
detect_captions=True,
skip_sub_headers=True
)
for table in result:
print(f"Caption: {table.caption}")
print(f"Shape: {table.dataframe.shape}")
print(table.dataframe)
Exploring the Notebooks
# Open any module as an interactive notebook
marimo edit src/markdown_table_extractor/core/parser.py
# Run the interactive demo
marimo edit src/markdown_table_extractor/core/extractor.py
API Reference
Core Functions
| Function | Description |
|---|---|
extract_tables(text) |
Simple API returning list[DataFrame] |
extract_markdown_tables(text, ...) |
Full API returning ExtractionResult |
Models
| Class | Description |
|---|---|
ExtractedTable |
Single table with metadata (caption, line numbers) |
ExtractionResult |
Collection of tables with errors and merge count |
TableMergeStrategy |
Enum: NONE, IDENTICAL_HEADERS, COMPATIBLE_COLUMNS |
Utilities (from parser.py notebook)
| Function | Description |
|---|---|
is_separator_row(line) |
Check if line is a table separator |
is_table_row(line) |
Check if line is any table row |
parse_table_row(line) |
Extract cell values from a row |
detect_caption(lines, start) |
Find table caption |
Utilities (from cleaner.py notebook)
| Function | Description |
|---|---|
clean_column_name(name) |
Remove HTML artifacts from header |
headers_match(h1, h2) |
Check if headers are compatible |
normalize_headers(headers) |
Canonical form for comparison |
Development
# Clone and setup
git clone https://github.com/username/markdown-table-extractor
cd markdown-table-extractor
uv sync --all-extras
# Run tests
uv run pytest
# Run linter
uv run ruff check src tests
# Type checking
uv run mypy src
# Edit any module interactively
uv run marimo edit src/markdown_table_extractor/core/parser.py
Project Structure
markdown-table-extractor/
├── pyproject.toml # Package configuration
├── README.md
├── src/
│ └── markdown_table_extractor/
│ ├── __init__.py # Public API
│ ├── py.typed # Type hints marker
│ └── core/
│ ├── __init__.py # Re-exports from notebooks
│ ├── models.py # 📓 Data classes notebook
│ ├── parser.py # 📓 Parsing utilities notebook
│ ├── cleaner.py # 📓 Cleaning utilities notebook
│ ├── merger.py # 📓 Merging logic notebook
│ └── extractor.py # 📓 Main extraction notebook
└── tests/
└── test_extractor.py
Why Marimo Notebooks?
| Traditional | Marimo Literate |
|---|---|
| Code in .py, docs separate | Code + docs in same file |
| Read code, guess intent | Read explanation alongside code |
| Tests in separate files | Interactive tests in notebook |
| Static documentation | Interactive documentation |
| Jupyter: JSON format | Pure Python, Git-friendly |
Why Not nbdev?
| nbdev | marimo |
|---|---|
| Jupyter notebooks (JSON) | Pure Python files |
| Requires export step | Direct import |
| `# | export` directives |
| Complex build process | Just Python |
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Edit the notebooks, add tests, and submit PRs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdown_table_extractor-0.1.2.tar.gz.
File metadata
- Download URL: markdown_table_extractor-0.1.2.tar.gz
- Upload date:
- Size: 46.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19db4f1aa564d0191f8d4448993924830788a479546b86e261fb8bb54e79ee6e
|
|
| MD5 |
86b040a390ee681266d0998db115db7a
|
|
| BLAKE2b-256 |
7801f15a169711fbe88227a6cd8b35fa97b81c6610e294498be978e412a3d484
|
File details
Details for the file markdown_table_extractor-0.1.2-py3-none-any.whl.
File metadata
- Download URL: markdown_table_extractor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 55.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab45284af580c036e5b5b5624fd929dd653c5e09cd890cabe842b82bee0fea78
|
|
| MD5 |
b4c739de3e4f034e630d70b5440e698e
|
|
| BLAKE2b-256 |
5ae190a6cf6add3f92392dd18eb12cbe6650b3e07423e7d1d408e4454e15fe3e
|