Skip to main content

Extract structured data from messy Markdown strings into Pydantic v2 models

Project description

md2pydantic

PyPI Python Versions License: MIT CI

Extract structured data from messy Markdown into Pydantic v2 models.

Built for resilience against common LLM output quirks: triple-backtick wrappers, trailing prose, incomplete tables, malformed JSON, and more. One line of code turns chaotic Markdown into validated, typed Python objects.

Features

  • One-liner API -- MDConverter(Model).parse_tables(md) gets you started in one line
  • Markdown tables -- pipe-delimited tables become lists of Pydantic models
  • JSON blocks -- fenced and inline JSON, with recovery for trailing commas, single quotes, unquoted keys, and truncated output
  • YAML blocks -- fenced YAML code blocks (requires pyyaml)
  • Auto-detect -- parse() tries code blocks first, then tables
  • Yes/No bool coercion -- "Yes", "No", "Y", "N", "true", "false", "on", "off" all map to bool
  • Null sentinel handling -- empty cells, "N/A", "NA", "null", "-", "—" become None for optional fields
  • Table selection -- filter tables by heading or index in multi-table documents
  • LLM-resilient -- handles unclosed code fences, trailing prose, extra backticks, and nested structures
  • Pydantic v2 native -- leverages Pydantic's own type coercion (str to int, str to float, str to datetime, etc.)
  • Lightweight -- only dependency is pydantic>=2.0.0

Installation

pip install md2pydantic

Or with uv:

uv add md2pydantic

Optional extras:

pip install md2pydantic[yaml]    # YAML block support (pyyaml)
pip install md2pydantic[pandas]  # DataFrame conversion (pandas)

Requires Python 3.10+.

Quick Start

Parse a Markdown Table

from pydantic import BaseModel
from md2pydantic import MDConverter

class Product(BaseModel):
    name: str
    price: float
    in_stock: bool

markdown = """
Here are the products currently available:

| name       | price | in_stock |
|------------|-------|----------|
| Widget     | 9.99  | Yes      |
| Gadget     | 24.50 | No       |
"""

products = MDConverter(Product).parse_tables(markdown)
# [Product(name='Widget', price=9.99, in_stock=True),
#  Product(name='Gadget', price=24.5, in_stock=False)]

Pydantic handles the str to float coercion. md2pydantic handles "Yes" / "No" to bool.

Parse a JSON Block

from pydantic import BaseModel
from md2pydantic import MDConverter

class ServerConfig(BaseModel):
    host: str
    port: int
    debug: bool

markdown = '''Sure! Here is the server configuration:

```json
{
    "host": "localhost",
    "port": 8080,
    "debug": true,
}

Let me know if you need anything else! '''

config = MDConverter(ServerConfig).parse_json(markdown)

ServerConfig(host='localhost', port=8080, debug=True)


Notice the trailing comma after `true` -- md2pydantic fixes that automatically.

### Parse a YAML Block

```python
from pydantic import BaseModel
from md2pydantic import MDConverter

class ServerConfig(BaseModel):
    host: str
    port: int
    debug: bool

markdown = '''Here is your config:

```yaml
host: api.example.com
port: 443
debug: false

'''

config = MDConverter(ServerConfig).parse_yaml(markdown)

ServerConfig(host='api.example.com', port=443, debug=False)


Requires `pyyaml`: install with `pip install md2pydantic[yaml]`.

### Auto-Detect Format

```python
from md2pydantic import MDConverter

# parse() tries JSON/YAML code blocks first, then falls back to tables
result = MDConverter(ServerConfig).parse(markdown)

Returns a single model instance for code blocks, or a list for tables and JSON arrays.

Select Tables by Heading

When a document contains multiple tables, filter by the preceding Markdown heading:

from pydantic import BaseModel
from md2pydantic import MDConverter

class User(BaseModel):
    name: str
    age: int
    active: bool

markdown = """
## Current Staff

| name  | age | active |
|-------|-----|--------|
| Alice | 30  | Yes    |

## Former Staff

| name  | age | active |
|-------|-----|--------|
| Bob   | 25  | No     |
| Eve   | 35  | No     |
"""

current = MDConverter(User).parse_tables(markdown, heading="Current Staff")
# [User(name='Alice', age=30, active=True)]

former = MDConverter(User).parse_tables(markdown, heading="Former Staff")
# [User(name='Bob', age=25, active=False), User(name='Eve', age=35, active=False)]

Heading matching is case-insensitive and supports substring matches. You can also select by index with index=0.

Handle Null Sentinels

Empty cells and common null placeholders become None for optional fields:

class Employee(BaseModel):
    name: str
    department: str
    salary: float | None = None

markdown = """
| name  | department  | salary |
|-------|-------------|--------|
| Alice | Engineering | 95000  |
| Bob   | Marketing   | N/A    |
| Carol | Sales       | -      |
"""

employees = MDConverter(Employee).parse_tables(markdown)
# employees[0].salary == 95000.0
# employees[1].salary is None  (from "N/A")
# employees[2].salary is None  (from "-")

Recognized null sentinels: "" (empty), "N/A", "NA", "null", "-", "—". Matching is case-insensitive.

Error Handling

from md2pydantic import MDConverter, ExtractionError

try:
    result = MDConverter(MyModel).parse_tables("no tables here")
except ExtractionError as e:
    print(e)            # Human-readable summary with line numbers
    print(e.errors)     # List of typed error details

ExtractionError is raised when:

  • No structured data is found in the input
  • Structured data is found but none of it validates against the model

Each error in .errors is either a TransformError (parsing failed) or ModelValidationError (Pydantic rejected the data), both with source location info.

Partial Results

When parsing tables with mixed valid/invalid rows, use partial=True to get both:

from md2pydantic import MDConverter, PartialResult

result = MDConverter(User).parse_tables(markdown, partial=True)
# result.data → list of valid User instances
# result.errors → list of typed errors with row locations
# result.has_errors → True if any rows failed

ExtractionError inherits from MD2PydanticError, so you can catch either.

Supported Formats

Format Method Fenced Inline Recovery
Markdown tables parse_tables() -- Yes Padded/truncated columns
JSON parse_json() Yes Yes Trailing commas, single quotes, unquoted keys, truncated JSON
YAML parse_yaml() Yes -- --
Auto-detect parse() Yes Yes All of the above

API Reference

MDConverter(model)

Create a converter bound to a Pydantic v2 BaseModel subclass.

converter = MDConverter(MyModel)

converter.parse_tables(markdown, *, index=None, heading=None) -> list[T]

Extract Markdown tables and return validated model instances (one per row).

  • index -- only parse the table at this 0-based position (applied after heading filter)
  • heading -- only parse tables under headings matching this substring (case-insensitive)
  • Raises ExtractionError if no tables are found or no rows validate

converter.parse_json(markdown) -> T

Extract a JSON code block and return a single validated model instance. Tries each JSON block in document order, returning the first that validates.

  • Raises ExtractionError if no JSON blocks are found or none validate

converter.parse_yaml(markdown) -> T

Extract a YAML code block and return a single validated model instance.

  • Raises ExtractionError if no YAML blocks are found or none validate
  • Requires pyyaml (pip install md2pydantic[yaml])

converter.parse(markdown) -> T | list[T]

Auto-detect format. Tries code blocks (JSON/YAML) first, then tables.

  • Raises ExtractionError if no structured data is found or none validates

Exceptions

Exception Parent Description
MD2PydanticError Exception Base exception for the library
ExtractionError MD2PydanticError No data found or validation failed. Has .errors attribute.

How It Works

md2pydantic follows a Seek, Clean, Validate pipeline:

  1. Scanner -- Uses regex and heuristics to identify candidate blocks (JSON, YAML, Markdown tables) within the input. Handles triple-backtick enclosures, unclosed fences, and trailing prose.

  2. Transformer -- Converts raw blocks into Python dictionaries. Fixes malformed JSON (trailing commas, single quotes, unquoted keys, truncated output). Converts table rows into dicts using headers as keys.

  3. Validator -- Passes dictionaries to your Pydantic model. Pre-processes Yes/No booleans and null sentinels before handing off to Pydantic's native coercion engine.

Development

git clone https://github.com/FelipeMorandini/md2pydantic.git
cd md2pydantic
uv sync --extra dev

uv run pytest              # run tests
uv run ruff check .        # lint
uv run ruff format .       # format
uv run mypy src/md2pydantic  # type check

See CONTRIBUTING.md for more details.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

md2pydantic-0.2.1.tar.gz (54.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

md2pydantic-0.2.1-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file md2pydantic-0.2.1.tar.gz.

File metadata

  • Download URL: md2pydantic-0.2.1.tar.gz
  • Upload date:
  • Size: 54.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for md2pydantic-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0644efbed567c8998484ac912f62dfe2bf4dab53fc88c31067604d4c07fea1fd
MD5 191b9fead1d77122b1af3257b4993a2d
BLAKE2b-256 bf98be6d78fc48469a5ae8ff040afa94896aafe67586ca991d87fb2978fe5318

See more details on using hashes here.

Provenance

The following attestation bundles were made for md2pydantic-0.2.1.tar.gz:

Publisher: release.yml on FelipeMorandini/md2pydantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file md2pydantic-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: md2pydantic-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for md2pydantic-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6b985bf9c23e3a942bb5ae2c09de03ec500a65b9f517e4e9b413f7ec8554b7c
MD5 e4e74578b09c45d15dc3b6f97bb4ec23
BLAKE2b-256 79f6ad2a2bd1f9247691e9fb4d86be96e3a6f0cb73edfe6c67b07bfc47a1a23a

See more details on using hashes here.

Provenance

The following attestation bundles were made for md2pydantic-0.2.1-py3-none-any.whl:

Publisher: release.yml on FelipeMorandini/md2pydantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page