Skip to main content

A robust, zero-dependency Python library for parsing, validating, and manipulating Markdown tables, including conversion from Excel to Markdown.

Project description

Markdown Spreadsheet Parser

License PyPI PyPI Downloads Repository Build Status

A robust, zero-dependency Python library for parsing, validating, and manipulating Markdown tables.


md-spreadsheet-parser elevates Markdown tables from simple text to first-class data structures. It offers a precise, zero-dependency engine to parse, validate, and manipulate tables with the ease of a spreadsheet and the power of Python.

🚀 Need a quick solution? Check out the Cookbook for copy-pasteable recipes (Excel conversion, Pandas integration, Markdown table manipulation, and more).

Table of Contents

Features

  • Pure Python & Zero Dependencies: Lightweight and portable. Perfect for AWS Lambda Layers and constrained environments. Runs anywhere Python runs, including WebAssembly (Pyodide).
  • Type-Safe Validation: Convert loose Markdown tables into strongly-typed Python dataclasses with automatic type conversion, including customizable boolean logic (I18N) and custom type converters.
  • Markdown as a Database: Treat your Markdown files as Git-managed configuration or master data. Validate schema and types automatically, preventing human error in handwritten tables.
  • Round-Trip Support: Parse to objects, modify data, and generate Markdown back. Perfect for editors.
  • Robust Parsing: Gracefully handles malformed tables (missing/extra columns) and escaped characters.
  • Multi-Table Workbooks: Support for parsing multiple sheets and tables from a single file, including metadata.
  • JSON & Dict Support: Column-level JSON parsing and direct conversion to dict/TypedDict.
  • Pandas Integration: seamlessly create DataFrames from markdown tables.
  • JSON-Friendly: Easy export to dictionaries/JSON for integration with other tools.

Installation

pip install md-spreadsheet-parser

Usage

1. Basic Parsing

Single Table Parse a standard Markdown table into a structured object.

from md_spreadsheet_parser import parse_table

markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""

result = parse_table(markdown)

print(result.headers)
# ['Name', 'Age']

print(result.rows)
# [['Alice', '30'], ['Bob', '25']]

Multiple Tables (Workbook) Parse a file containing multiple sheets (sections). By default, it looks for # Tables as the root marker and ## Sheet Name for sheets.

from md_spreadsheet_parser import parse_workbook, MultiTableParsingSchema

markdown = """
# Tables

## Users
| ID | Name |
| -- | ---- |
| 1  | Alice|

## Products
| ID | Item |
| -- | ---- |
| A  | Apple|
"""

# Use default schema
schema = MultiTableParsingSchema()
workbook = parse_workbook(markdown, schema)

for sheet in workbook.sheets:
    print(f"Sheet: {sheet.name}")
    for table in sheet.tables:
        print(table.rows)

Lookup API & Metadata Retrieve sheets and tables directly by name, and access parsed metadata like descriptions.

from md_spreadsheet_parser import parse_workbook

markdown = """
# Tables

## Sales Data

### Q1 Results
Financial performance for the first quarter.

| Year | Revenue |
| ---- | ------- |
| 2023 | 1000    |
"""

workbook = parse_workbook(markdown)

# Access by name
sheet = workbook.get_sheet("Sales Data")
if sheet:
    # Retrieve table by name (from ### Header)
    table = sheet.get_table("Q1 Results")
    
    print(table.description)
    # "Financial performance for the first quarter."
    
    print(table.rows)
    # [['2023', '1000']]

Simple Scan Interface If you want to extract all tables from a document regardless of its structure (ignoring sheets and headers), use scan_tables.

from md_spreadsheet_parser import scan_tables

markdown = """
| ID | Name |
| -- | ---- |
| 1  | Alice|

... text ...

| ID | Item |
| -- | ---- |
| A  | Apple|
"""

# Returns a flat list of all tables found
tables = scan_tables(markdown)
print(len(tables)) # 2

File Loading Helpers

For convenience, you can parse directly from a file path (str or Path) or file-like object using the _from_file variants:

from md_spreadsheet_parser import parse_workbook_from_file

# Clean and easy
workbook = parse_workbook_from_file("data.md")

Available helpers:

  • parse_table_from_file(path_or_file)
  • parse_workbook_from_file(path_or_file)
  • scan_tables_from_file(path_or_file)

2. Type-Safe Validation (Recommended)

The most powerful feature of this library is converting loose markdown tables into strongly-typed Python objects using dataclasses. This ensures your data is valid and easy to work with.

from dataclasses import dataclass
from md_spreadsheet_parser import parse_table, TableValidationError

@dataclass
class User:
    name: str
    age: int
    is_active: bool = True

markdown = """
| Name | Age | Is Active |
|---|---|---|
| Alice | 30 | yes |
| Bob | 25 | no |
"""

try:
    # Parse and validate in one step
    users = parse_table(markdown).to_models(User)
    
    for user in users:
        print(f"{user.name} is {user.age} years old.")
        # Alice is 30 years old.
        # Bob is 25 years old.

except TableValidationError as e:
    print(e)

Features:

  • Type Conversion: Automatically converts strings to int, float, bool using standard rules.
  • Boolean Handling (Default): Supports standard pairs out-of-the-box: true/false, yes/no, on/off, 1/0. (See Advanced Type Conversion for customization).
  • Optional Fields: Handles Optional[T] by converting empty strings to None.
  • Validation: Raises detailed errors if data doesn't match the schema.

Pydantic Integration

For more advanced validation (email format, ranges, regex), you can use Pydantic models instead of dataclasses. This feature is enabled automatically if pydantic is installed.

from pydantic import BaseModel, Field, EmailStr

class User(BaseModel):
    name: str = Field(alias="User Name")
    age: int = Field(gt=0)
    email: EmailStr

# Automatically detects Pydantic model and uses it for validation
users = parse_table(markdown).to_models(User)

The parser respects Pydantic's alias and Field constraints.

3. JSON & Dictionary Conversion

Sometimes you don't want to define a full Dataclass or Pydantic model, or you have columns containing JSON strings.

Simple Dictionary Output Convert tables directly to a list of dictionaries. Keys are derived from headers.

# Returns list[dict[str, Any]] (Values are raw strings)
rows = parse_table(markdown).to_models(dict)
print(rows[0])
# {'Name': 'Alice', 'Age': '30'}

TypedDict Support Use TypedDict for lightweight type safety. The parser uses the type annotations to convert values automatically.

from typing import TypedDict

class User(TypedDict):
    name: str
    age: int
    active: bool

rows = parse_table(markdown).to_models(User)
print(rows[0])
# {'name': 'Alice', 'age': 30, 'active': True}

Column-Level JSON Parsing If a field is typed as dict or list (in a Dataclass or Pydantic model), the parser automatically parses the cell value as JSON.

@dataclass
class Config:
    id: int
    metadata: dict  # Cell: '{"debug": true}' -> Parsed to dict
    tags: list      # Cell: '["a", "b"]'      -> Parsed to list

# Pydantic models also work without Json[] wrapper
class ConfigModel(BaseModel):
    metadata: dict

Limitations:

  • JSON Syntax: The cell content must be valid JSON (e.g. double quotes {"a": 1}). Malformed JSON raises a ValueError.
  • Simple Dict Parsing: to_models(dict) does not automatically parse inner JSON strings unless you use a custom schema. It only creates a shallow dictionary of strings.

4. Pandas Integration & Export

This library is designed to be a bridge between Markdown and Data Science tools like Pandas.

Convert to DataFrame (Easiest Way) The cleanest way to create a DataFrame is using to_models(dict). This returns a list of dictionaries that Pandas can ingest directly.

import pandas as pd
from md_spreadsheet_parser import parse_table

markdown = """
| Date       | Sales | Region |
|------------|-------|--------|
| 2023-01-01 | 100   | US     |
| 2023-01-02 | 150   | EU     |
"""

table = parse_table(markdown)

# 1. Convert to list of dicts
data = table.to_models(dict)

# 2. Create DataFrame
df = pd.DataFrame(data)

# 3. Post-Process: Convert types (Pandas usually infers strings initially)
df["Sales"] = pd.to_numeric(df["Sales"])
df["Date"] = pd.to_datetime(df["Date"])

print(df.dtypes)
# Date      datetime64[ns]
# Sales              int64
# Region            object

Convert from Type-Safe Objects If you want to validate data before creating a DataFrame (e.g., ensuring "Sales" is an integer during parsing), use a dataclass and then convert to Pandas.

from dataclasses import dataclass, asdict

@dataclass
class SalesRecord:
    date: str
    amount: int
    region: str

# 1. Parse and Validate (Raises TableValidationError if invalid)
records = parse_table(markdown).to_models(SalesRecord)

# 2. Convert to DataFrame using asdict()
df = pd.DataFrame([asdict(r) for r in records])

# The 'amount' column is already int64 because validation handled conversion
print(df["amount"].dtype) # int64

JSON Export All result objects (Workbook, Sheet, Table) have a .json property that returns a dictionary structure suitable for serialization.

import json

# Export entire workbook structure
print(json.dumps(workbook.json, indent=2))

5. Markdown Generation (Round-Trip)

You can modify parsed objects and convert them back to Markdown strings using to_markdown(). This enables a complete "Parse -> Modify -> Generate" workflow.

from md_spreadsheet_parser import parse_table, ParsingSchema

markdown = "| A | B |\n|---|---| \n| 1 | 2 |"
table = parse_table(markdown)

# Modify data
table.rows.append(["3", "4"])

# Generate Markdown
# You can customize the output format using a schema
schema = ParsingSchema(require_outer_pipes=True)
print(table.to_markdown(schema))
# | A | B |
# | --- | --- |
# | 1 | 2 |
# | 3 | 4 |

6. Advanced Features

Metadata Extraction Configuration By default, the parser captures table names (level 3 headers) and descriptions. You can customize this behavior with MultiTableParsingSchema.

from md_spreadsheet_parser import MultiTableParsingSchema

schema = MultiTableParsingSchema(
    table_header_level=3,     # Treat ### Header as table name
    capture_description=True  # Capture text between header and table
)
# Pass schema to parse_workbook...

7. Advanced Type Conversion

You can customize how string values are converted to Python objects by passing a ConversionSchema to to_models(). This is useful for internationalization (I18N) and handling custom types.

Internationalization (I18N): Custom Boolean Pairs

Configure which string pairs map to True/False (case-insensitive).

from md_spreadsheet_parser import parse_table, ConversionSchema

markdown = """
| User | Active? |
| --- | --- |
| Tanaka | はい |
| Suzuki | いいえ |
"""

# Configure "はい" -> True, "いいえ" -> False
schema = ConversionSchema(
    boolean_pairs=(("はい", "いいえ"),)
)

users = parse_table(markdown).to_models(User, conversion_schema=schema)
# Tanaka.active is True

Custom Type Converters

Register custom conversion functions for specific types. You can use ANY Python type as a key, including:

  • Built-ins: int, float, bool (to override default behavior)
  • Standard Library: Decimal, datetime, date, ZoneInfo, UUID
  • Custom Classes: Your own data classes or objects

Example using standard library types and a custom class:

from dataclasses import dataclass
from uuid import UUID
from zoneinfo import ZoneInfo
from md_spreadsheet_parser import ConversionSchema, parse_table

@dataclass
class Color:
    r: int
    g: int
    b: int

@dataclass
class Config:
    timezone: ZoneInfo
    session_id: UUID
    theme_color: Color

markdown = """
| Timezone | Session ID | Theme Color |
| --- | --- | --- |
| Asia/Tokyo | 12345678-1234-5678-1234-567812345678 | 255,0,0 |
"""

schema = ConversionSchema(
    custom_converters={
        # Standard Library Types
        ZoneInfo: lambda v: ZoneInfo(v),
        UUID: lambda v: UUID(v),
        # Custom Class
        Color: lambda v: Color(*map(int, v.split(",")))
    }
)

data = parse_table(markdown).to_models(Config, conversion_schema=schema)
# data[0].timezone is ZoneInfo("Asia/Tokyo")
# data[0].theme_color is Color(255, 0, 0)

Field-Specific Converters

For granular control, you can define converters for specific field names, which take precedence over type-based converters.

def parse_usd(val): ...
def parse_jpy(val): ...

schema = ConversionSchema(
    # Type-based defaults (Low priority)
    custom_converters={
        Decimal: parse_usd 
    },
    # Field-name overrides (High priority)
    field_converters={
        "price_jpy": parse_jpy,
        "created_at": lambda x: datetime.strptime(x, "%Y/%m/%d")
    }
)

# price_usd (no override) -> custom_converters (parse_usd)
# price_jpy (override)    -> field_converters (parse_jpy)
data = parse_table(markdown).to_models(Product, conversion_schema=schema)

Standard Converters Library

For common patterns (currencies, lists), you can use the built-in helper functions in md_spreadsheet_parser.converters instead of writing your own.

from md_spreadsheet_parser.converters import (
    to_decimal_clean,        # Handles "$1,000", "¥500" -> Decimal
    make_datetime_converter, # Factory for parse/TZ logic
    make_list_converter,     # "a,b,c" -> ["a", "b", "c"]
    make_bool_converter      # Custom strict boolean sets
)

schema = ConversionSchema(
    custom_converters={
        # Currency: removes $, ¥, €, £, comma, space
        Decimal: to_decimal_clean,
        # DateTime: ISO format default, attach Tokyo TZ if naive
        datetime: make_datetime_converter(tz=ZoneInfo("Asia/Tokyo")),
        # Lists: Split by comma, strip whitespace
        list: make_list_converter(separator=",")
    },
    field_converters={
        # Custom boolean for specific field
        "is_valid": make_bool_converter(true_values=["OK"], false_values=["NG"])
    }
)

8. Robustness (Handling Malformed Tables)

The parser is designed to handle imperfect markdown tables gracefully.

  • Missing Columns: Rows with fewer columns than the header are automatically padded with empty strings.
  • Extra Columns: Rows with more columns than the header are automatically truncated.
from md_spreadsheet_parser import parse_table

markdown = """
| A | B |
|---|---|
| 1 |       <-- Missing column
| 1 | 2 | 3 <-- Extra column
"""

table = parse_table(markdown)

print(table.rows)
# [['1', ''], ['1', '2']]

This ensures that table.rows always matches the structure of table.headers, preventing crashes during iteration or validation.

9. In-Cell Line Break Support

The parser automatically converts HTML line breaks to Python newlines (\n). This enables handling multiline cells naturally.

Supported Tags (Case-Insensitive):

  • <br>
  • <br/>
  • <br />
markdown = "| Line1<br>Line2 |"
table = parse_table(markdown)
# table.rows[0][0] == "Line1\nLine2"

Round-Trip Support: When generating Markdown (e.g., table.to_markdown()), Python newlines (\n) are automatically converted back to <br> tags to preserve the table structure.

To disable this, set convert_br_to_newline=False in ParsingSchema.

10. Performance & Scalability (Streaming API)

Do you really have a 10GB Markdown file?

Probably not. We sincerely hope you don't. Markdown wasn't built for that.

But if you do—perhaps you're generating extensive logs or auditing standard converters—this library has your back. While Excel gives up after 1,048,576 rows, md-spreadsheet-parser supports streaming processing for files of unlimited size, keeping memory usage constant.

scan_tables_iter: This function reads the file line-by-line and yields Table objects as they are found. It does not load the entire file into memory.

from md_spreadsheet_parser import scan_tables_iter

# Process a massive log file (e.g., 10GB)
# Memory usage remains low (only the size of a single table block)
for table in scan_tables_iter("huge_server_log.md"):
    print(f"Found table with {len(table.rows)} rows")
    
    # Process rows...
    for row in table.rows:
        pass

This is ideal for data pipelines, log analysis, and processing exports that are too large to open in standard spreadsheet editors.

11. Programmatic Manipulation

The library provides immutable methods to modify the data structure. These methods return a new instance of the object with the changes applied, keeping the original object unchanged.

Workbook Operations

# Add a new sheet (creates a default table with headers A, B, C)
new_wb = workbook.add_sheet("New Sheet")

# Rename a sheet
new_wb = workbook.rename_sheet(sheet_index=0, new_name("Budget 2024"))

# Delete a sheet
new_wb = workbook.delete_sheet(sheet_index=1)

Sheet Operations

# Rename sheet (direct method)
new_sheet = sheet.rename("Q1 Data")

# Update table metadata
new_sheet = sheet.update_table_metadata(
    table_index=0, 
    name="Expenses", 
    description="Monthly expense report"
)

Table Operations

# Update a cell (automatically expands table if index is out of bounds)
new_table = table.update_cell(row_idx=5, col_idx=2, value="Updated")

# Delete a row (structural delete)
new_table = table.delete_row(row_idx=2)

# Clear column data (keeps headers and row structure, empties cells)
new_table = table.clear_column_data(col_idx=3)

12. Visual Metadata Persistence

The library supports persisting visual state (like column widths and filter settings) without altering the Markdown table structure itself. This is achieved via a hidden HTML comment appended after the table.

| A | B |
|---|---|
| 1 | 2 |

<!-- md-spreadsheet-metadata: {"columnWidths": [100, 200]} -->

This ensures that:

  1. Clean Data: The table remains standard Markdown, readable by any renderer.
  2. Rich State: Compatible tools (like our VS Code Extension) can read the comment to restore UI state (column widths, hidden columns, etc.).
  3. Robustness: The parser automatically associates this metadata with the preceding table, even if separated by blank lines.

Command Line Interface (CLI)

You can use the md-spreadsheet-parser command to parse Markdown files and output JSON. This is useful for piping data to other tools.

# Read from file
md-spreadsheet-parser input.md

# Read from stdin (pipe)
cat input.md | md-spreadsheet-parser

Options:

  • --scan: Scan for all tables ignoring workbook structure (returns a list of tables).
  • --root-marker: Set the root marker (default: # Tables).
  • --sheet-header-level: Set sheet header level (default: 2).
  • --table-header-level: Set table header level (default: 3).
  • --capture-description: Capture table descriptions (default: True).
  • --column-separator: Character used to separate columns (default: |).
  • --header-separator-char: Character used in the separator row (default: -).
  • --no-outer-pipes: Allow tables without outer pipes (default: False).
  • --no-strip-whitespace: Do not strip whitespace from cell values (default: False).
  • --no-br-conversion: Disable automatic conversion of <br> tags to newlines (default: False).

Configuration

Customize parsing behavior using ParsingSchema and MultiTableParsingSchema.

Option Default Description
column_separator | Character used to separate columns.
header_separator_char - Character used in the separator row.
require_outer_pipes True If True, generated markdown tables will include outer pipes.
strip_whitespace True If True, whitespace is stripped from cell values.
convert_br_to_newline True If True, <br> tags are converted to \n (and back).
root_marker # Tables (MultiTable) Marker indicating start of data section.
sheet_header_level 2 (MultiTable) Header level for sheets.
table_header_level 3 (MultiTable) Header level for tables.
capture_description True (MultiTable) Capture text between header and table.

Future Roadmap

This parser is the core foundation of a new ecosystem: Text-Based Spreadsheet Management.

We are preparing to release a Rich VS Code Extension that acts as a full GUI Spreadsheet Editor for Markdown files.

[!NOTE] Coming Soon: The official release of the VS Code Extension will be announced here alongside v1.0.0 of this library. Stay tuned!

The Vision: "Excel-like UX, Git-native Data" By combining a high-performance editor with this robust parser, we aim to solve the long-standing problem of managing binary spreadsheet files in software projects.

  • For Humans: Edit data with a comfortable, familiar UI (cell formatting, improved navigation, visual feedback).
  • For Machines: Data is saved as clean, diff-able Markdown that this library can parse, validate, and convert into Python objects instantaneously.

Upcoming Features We are actively working on:

  • Visual Styling: Conditional formatting and highlighters.
  • Rich Data Hints: Define column behaviors (Dropdowns, Date Pickers) that the editor enforces and the parser validates.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

md_spreadsheet_parser-0.4.3.tar.gz (105.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

md_spreadsheet_parser-0.4.3-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file md_spreadsheet_parser-0.4.3.tar.gz.

File metadata

  • Download URL: md_spreadsheet_parser-0.4.3.tar.gz
  • Upload date:
  • Size: 105.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for md_spreadsheet_parser-0.4.3.tar.gz
Algorithm Hash digest
SHA256 68fca3798efe6207e13095e368798be24a5841691ee7917eea59b634dcf7f1f1
MD5 5ed21b2d72d437cd0462e990a7451e0e
BLAKE2b-256 a3c6bc164be53a65fad7f10dfa707e826bc15bf5cc29eebff6d6c9b4e884785c

See more details on using hashes here.

File details

Details for the file md_spreadsheet_parser-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: md_spreadsheet_parser-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for md_spreadsheet_parser-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b10a2ce4de2e8184afe0381d955b101b64a02059c1e04352523b18f15bb1cc69
MD5 59d631e579334453daf850269e7421eb
BLAKE2b-256 5361e672f8d3544acf89f5fd3804e472419955e1596b71cc024bded85d22a077

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page