Skip to main content

Convert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.

Project description

markdown-to-data

Convert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.

[WIP] This project is still work in progress and early state. The functionality is limited (see Status).

Status

  • Detect, extract and convert markdown building blocks into Python data structures
  • Provide two formats for parsed markdown:
    • List format: Each building block as separate dictionary in a list
    • Dictionary format: Nested structure using headers as keys
  • Convert parsed markdown to JSON
  • Parse markdown data back to markdown formatted string
    • add options which data gets parsed back to markdown
  • Extract specific building blocks (e.g., only tables or lists)
  • Provide comprehensive documentation
  • Add more test coverage --> 134 test cases
  • Publish on PyPI

Table of content:

Quick Overview

Installation

(!NOT WORKING! CURRENTLY NOT ON PyPI!)

pip install markdown-to-data

Basic Usage

from markdown_to_data import Markdown

markdown = """
---
title: Example text
author: John Doe
---

# Main Header

- Item 1
- Item 2
    - Subitem 1

## Table Example
| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |
"""

md = Markdown(markdown)

# Get parsed markdown as list
print(md.md_list)
# Each building block is a separate dictionary in the list

# Get parsed markdown as nested dictionary
print(md.md_dict)
# Headers are used as keys for nesting content

# Get a list of markdown elements included in the markdown file, the number of their appearance, the position and types
print(md.md_elements)

# Get the nested dictionary as a JSON string
print(md.to_json(indent=4))

# Extract specific building blocks
print(md.get_md_building_blocks(blocks=['table']))

Output Formats

List Format (md.md_list)

[
    {'metadata': {'title': 'Example text', 'author': 'John Doe'}},
    {'h1': 'Main Header'},
    {
        'list': {
            'type': 'ul',
            'list': ['Item 1', {'Item 2': ['Subitem 1']}]
        }
    },
    {'h2': 'Table Example'},
    {'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]}
]

Dictionary Format (md.md_dict)

{
    'metadata': {'title': 'Example text', 'author': 'John Doe'},
    'Main Header': {
        'list': {
            'type': 'ul',
            'list': ['Item 1', {'Item 2': ['Subitem 1']}]
        },
        'Table Example': {
            'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]
        }
    }
}

MD Elements (md.md_elements)

Get information about all markdown elements in the markdown file. The output is based on md_list and can be used for

  • Creating a table of contents based on headings
  • Finding specific elements by their positions in md_list
  • Jump to specific sections in md_list
  • Checking if required elements are present
  • Understanding the document's composition and complexity
  • Identifying patterns in document structure
{
    'metadata': {'count': 1, 'positions': [0], 'variants': set()},
    'h1': {'count': 1, 'positions': [1], 'variants': set()},
    'list': {'count': 1, 'positions': [2], 'variants': {'ul'}},
    'h2': {'count': 1, 'positions': [3], 'variants': set()},
    'table': {'count': 1, 'positions': [4], 'variants': set()}
}

JSON (md.to_json(indent=4))

Converts the md_dict to a JSON string. By applying ìndent you can specify the indents for the output.

{
    'metadata': {'title': 'Example text', 'author': 'John Doe'},
    'Main Header': {
        'list': {
            'type': 'ul',
            'list': ['Item 1', {'Item 2': ['Subitem 1']}]
        },
        'Table Example': {
            'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]
        }
    }
}

Building blocks (md.get_md_building_blocks(blocks=['table']))

[
    {
        'table': [
            {
                'Column 1': 'Cell 1',
                'Column 2': 'Cell 2'
            }
        ]
    }
]

Parse back to markdown (to_md)

The Markdown class comes as well with a method to parse the data of markdown elements back to markdown formatted strings. The method is called to_md and comes with some arguments to manipulate the outcome.

from markdown_to_data import Markdown

markdown = """
---
title: Example text
author: John Doe
---

# Main Header

- Item 1
- Item 2
    - Subitem 1

## Table Example
| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |
"""

md = Markdown(markdown)

Example 1: include all and exclude nothing

print(md.to_md(
    include=['headers', 'list', 4], # A list of markdown elements that will by included (here: all headers, the list and the fifth elements)
    exclude=[1], # the default value is None; markdown elements will be excluded based on the index in this argument;
    spacer=1 # the default value; defines how many empty lines will be added after each markdown element
))

Output:

# Main Header

- Item 1
- Item 2
  - Subitem 1

| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |

Example 2: exclude overwrites include and two spacers

print(md.to_md(
    include=['all'], # the default value; will include all markdown elements
    exclude=['h2', 3], # will overwrite `include` and exclude h2 headers and the fourth element (here: the list)
    spacer=2 # adds two empty line after each markdown elements which gets parsed
))

Output:

---
title: Example text
author: John Doe
---


# Main Header


| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |

Example 3: exclude = ['all'] excludes everything and returns an empty line

print(md.to_md(
    include=['h1', 'list', 'table'],
    exclude=['all'], # will overwrite the `include``and exclude all markdown elements
    spacer=1
))

Output:


to_md_parser function

Note: you can use the function to_md_parser to parse a list of dictionaries of markdown elements to markdown.

from markdown_to_data import to_md_parser

example = [
    {
        'metadata': {
            'title': 'Test Document',
            'date': '2024-01-01'
        }
    },
    {'h1': 'Main Title'},
    {'paragraph': 'Sample paragraph'},
    {'h2': 'Subtitle'},
    {'list': {
        'type': 'ul',
        'list': ['Item 1', 'Item 2']
    }}
]

markdown_string = to_md_parser(data=example, spacer=1)

print(markdown_string)

Output:

---
title: Test Document
date: 2024-01-01
---

# Main Title

Sample paragraph

## Subtitle

- Item 1
- Item 2

Supported Markdown Elements

Metadata (YAML frontmatter)

A metadata block can only appear once in the markdown file and must be at the beginning.

metadata = '''
---
title: Document
author: John Doe
date: 2023-12-20
---
'''

md_metadata = Markdown(metadata)
print(md_metadata.md_list)
print(md_metadata.md_dict)

`md_list'

[
    {
        'metadata': {
            'title': 'Document',
            'author': 'John Doe',
            'date': '2023-12-20'
        }
    }
]

'md_dict`

{'metadata': {'title': 'Document', 'author': 'John Doe', 'date': '2023-12-20'}}

Headers (h1-h6)

headers = '''
# Heading level 1

## Heading level 2

## Heading level 2

### Heading level 3

# Heading level 1 again
'''

md_headers = Markdown(headers)
print(md_headers.md_list)
print(md_headers.md_dict)

`md_list'

[
    {'h1': 'Heading level 1'},
    {'h2': 'Heading level 2'},
    {'h2': 'Heading level 2'},
    {'h3': 'Heading level 3'},
    {'h1': 'Heading level 1 again'}
]

'md_dict`

{
    'Heading level 1': {'Heading level 2': {'Heading level 3': {}}},
    'Heading level 1 again': {}
}

Lists (ordered and unordered with nesting)

lists = '''
- item 1
- item 2
    - subitem 1
    - subitem 2
- item 3

1. item 1
2. item 2
3. item 3
'''

md_lists = Markdown(lists)
print(md_lists.md_list)
print(md_lists.md_dict)

`md_list'

[
    {
        'list': {
            'type': 'ul',
            'list': [
                'item 1',
                {'item 2': ['subitem 1', 'subitem 2']},
                'item 3'
            ]
        }
    },
    {'list': {'type': 'ol', 'list': ['item 1', 'item 2', 'item 3']}}
]

'md_dict`

{
    'list': {
        'type': 'ul',
        'list': [
            'item 1',
            {'item 2': ['subitem 1', 'subitem 2']},
            'item 3'
        ]
    },
    'list2': {'type': 'ol', 'list': ['item 1', 'item 2', 'item 3']}
}

Tables

tables = '''
| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |

| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |
'''

md_tables = Markdown(tables)
print(md_tables.md_list)
print(md_tables.md_dict)

`md_list'

[
    {'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]},
    {'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]}
]

'md_dict`

{
    'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}],
    'table2': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]
}

Code blocks (with language detection)

code = '''
´´´
{
    'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}],
    'table2': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]
}
´´´

´´´python
def hello():
    print('Hello World!')
´´´
'''

md_code = Markdown(code)
print(md_code.md_list)
print(md_code.md_dict)

`md_list'

[
    {
        'code': {
            'language': None,
            'content': "    'table': [{'Column 1': 'Cell 1', 'Column 2':
'Cell 2'}],\n    'table2': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]\n}"
        }
    },
    {
        'code': {
            'language': 'python',
            'content': "def hello():\n    print('Hello World!')"
        }
    }
]

'md_dict`

{
    'code': {
        'language': '{',
        'content': "    'table': [{'Column 1': 'Cell 1', 'Column 2': 'Cell
2'}],\n    'table2': [{'Column 1': 'Cell 1', 'Column 2': 'Cell 2'}]\n}"
    },
    'code2': {
        'language': 'python',
        'content': "def hello():\n    print('Hello World!')"
    }
}

Definition lists

def_lists = '''
term 1
: definition 1
: definition 2

term 2
: definition 1
: definition 2
'''

md_def_lists = Markdown(def_lists)
print(md_def_lists.md_list)
print(md_def_lists.md_dict)

`md_list'

[
    {
        'def_list': {
            'term': 'term 1',
            'list': ['definition 1', 'definition 2']
        }
    },
    {
        'def_list': {
            'term': 'term 2',
            'list': ['definition 1', 'definition 2']
        }
    }
]

'md_dict`

{
    'def_list': {
        'term': 'term 1',
        'list': ['definition 1', 'definition 2']
    },
    'def_list2': {
        'term': 'term 2',
        'list': ['definition 1', 'definition 2']
    }
}

Blockquotes

blockquotes = '''
> a single line blockquote

> a nested blockquote
> with multiline
>> the nested part
> last line of the blockquote
'''

md_blockquotes = Markdown(blockquotes)
print(md_blockquotes.md_list)
print(md_blockquotes.md_dict)

`md_list'

[
    {'blockquote': ['a single line blockquote']},
    {
        'blockquote': [
            'a nested blockquote',
            {'with multiline': ['the nested part']},
            'last line of the blockquote'
        ]
    }
]

'md_dict`

{
    'blockquote': ['a single line blockquote'],
    'blockquote2': [
        'a nested blockquote',
        {'with multiline': ['the nested part']},
        'last line of the blockquote'
    ]
}

Paragraphs

paragraphs = '''
A paragraph
a second paragraph

a paragraph after a empty row
'''

md_paragraphs = Markdown(paragraphs)
rich.print(md_paragraphs.md_list)
rich.print(md_paragraphs.md_dict)

`md_list'

[
    {'paragraph': 'A paragraph'},
    {'paragraph': 'a second paragraph'},
    {'paragraph': 'a paragraph after a empty row'}
]

'md_dict`

{
    'paragraph': 'A paragraph',
    'paragraph2': 'a second paragraph',
    'paragraph3': 'a paragraph after a empty row'
}

Separator

As described in the example for Metadata a metadata block must appear at the very beginning of a markdown file. Later in the file a combination of three - (=---) will be classified as a separator.

separator = '''
---
'''

md_separator = Markdown(separator)
print(md_separator.md_list)
print(md_separator.md_dict)

`md_list'

[
    {'separator': '---'}
]

'md_dict`

{
    'separator': '---'
}

Why markdown-to-data?

This library focuses on converting markdown into structured data formats that are easy to process programmatically. It's particularly useful for:

  • Working with LLMs that output markdown-formatted responses
  • Extracting structured data from markdown documentation
  • Processing markdown content in data pipelines
  • Building automation tools that work with markdown content

Limitations

  • Some extended markdown flavors might not be supported
  • Complex nested structures might need additional processing
  • Currently only supports basic markdown elements

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_to_data-0.1.0.tar.gz (32.9 kB view details)

Uploaded Source

Built Distribution

markdown_to_data-0.1.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file markdown_to_data-0.1.0.tar.gz.

File metadata

  • Download URL: markdown_to_data-0.1.0.tar.gz
  • Upload date:
  • Size: 32.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for markdown_to_data-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6eb0101267b0a5e3d44e52bb6f6561aafeadde01c4f8f00c9d58722526cb5a0f
MD5 69930810e93e15da8c01351dea09eb5f
BLAKE2b-256 a4cbcb75eb9762bee7d6995a738d60e1fa6321b03efe155e3322f9bac06f49fc

See more details on using hashes here.

File details

Details for the file markdown_to_data-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markdown_to_data-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd49762050a3a793f1a2a98755ac742d75864cadff9d1c03d2fcb2965c80c1dc
MD5 be4f1728f67020405ee34920ba95d875
BLAKE2b-256 2c82a801c36af54726dfb61350ed641da908622b1bbad465b3bb709e3f4fc024

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page