Skip to main content

Markdown helpers & models

Project description

mkdown

PyPI License Package status Monthly downloads Distribution format Wheel availability Python version Implementation Releases Github Contributors Github Discussions Github Forks Github Issues Github Issues Github Watchers Github Stars Github Repository size Github last commit Github release date Github language count Github commits this month Package status PyUp

Read the documentation!

Markdown Conventions for OCR Output

This project utilizes Markdown as the primary, self-contained format for storing OCR results and associated metadata. The goal is to have a single, versionable, human-readable file representing a processed document, simplifying pipeline management and data provenance.

We employ a hybrid approach, using different mechanisms for different types of metadata:

1. Metadata Comments (for Non-Visual Markers)

For metadata that should not affect the visual rendering of the Markdown (like page boundaries or page-level information), we use specially formatted HTML/XML comments.

Format:

<!-- docler:data_type {json_payload} -->
  • data_type: A string indicating the kind of metadata (e.g., page_break, chunk_boundary).
  • {json_payload}: A standard JSON object containing the metadata key-value pairs, serialized.

Defined Types:

  • page_break: Marks the transition to the specified page number. Placed immediately before the content of the new page.
    • Example Payload: {"next_page": 2}
    • Example Comment: <!-- docler:page_break {"next_page": 2 } -->
  • chunk_boundary: Marks a transition where a document should get chunked (semantically).
    • Example Payload: {"chunk_id": 1}
    • Example Comment: <!-- docler:chunk_boundary {"chunk_id": 1 } -->

2. HTML Figures (for Images and Diagrams)

For visual elements like images or diagrams, especially when they require richer metadata (like source code or bounding boxes), we use standard HTML structures within the Markdown. This allows direct association of metadata and handles complex data like code snippets gracefully.

Structure:

We typically use an HTML <figure> element:

<figure data-docler-type="diagram" data-diagram-id="sysarch-01">
  <img src="images/system_architecture.png"
       alt="System Architecture Diagram"
       data-page-num="5"
       style="max-width: 100%; height: auto;"
       >
  <figcaption>Figure 2: High-level system data flow.</figcaption>
  <script type="text/docler-mermaid">
    graph LR
        A[Data Ingest] --> B(Processing Queue);
        B --> C{Main Processor};
        D --> F(API Endpoint);
  </script>
</figure>
  • <figure>: The container element.
    • data-docler-type: Indicates the type of figure (e.g., image, diagram).
    • Other data-* attributes can be added for figure-level metadata.
  • <img>: The visual representation.
    • src, alt: Standard attributes.
    • data-*: Used for image-specific metadata like data-page-num
    • style: Optional for basic presentation.
  • <figcaption>: Optional standard HTML caption.
  • <script type="text/docler-...">: Used to embed source code or other complex textual data.
    • The type attribute is custom (e.g., text/docler-mermaid, text/docler-latex) so browsers ignore it.
    • The raw code/text is placed inside, preserving formatting.

Rationale

  • Comments are used for page breaks and metadata because they are guaranteed not to interfere with Markdown rendering, ensuring purely structural information remains invisible.
  • HTML Figures are used for images/diagrams because HTML provides standard ways (data-*, nested elements like <script>) to directly associate rich, potentially complex or multi-line metadata (like source code) with the visual element itself.

Utilities

Helper functions for creating and parsing these metadata comments and structures are available in docler.markdown_utils.

Standardized Metadata Types

The library provides standardized metadata types for common use cases:

  1. Page Breaks: Use PAGE_BREAK_TYPE constant and create_metadata_comment() function to create page transitions:

    from docler.markdown_utils import create_metadata_comment, PAGE_BREAK_TYPE
    
    # Create a page break marker for page 2
    page_break = create_metadata_comment(PAGE_BREAK_TYPE, {"next_page": 2})
    # <!-- docler:page_break {"next_page":2} -->
    
  2. Chunk Boundaries: Use CHUNK_BOUNDARY_TYPE constant and create_chunk_boundary() function to mark semantic chunks in a document:

    from docler.markdown_utils import create_chunk_boundary
    
    # Create a chunk boundary marker with metadata
    chunk_marker = create_chunk_boundary(
        chunk_id=1,
        start_line=10,
        end_line=25,
        keywords=["introduction", "overview"],
        token_count=350,
    )
    # <!-- docler:chunk_boundary {"chunk_id":1,"end_line":25,"keywords":["introduction","overview"],"start_line":10,"token_count":350} -->
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mkdown-1.0.1.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mkdown-1.0.1-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file mkdown-1.0.1.tar.gz.

File metadata

  • Download URL: mkdown-1.0.1.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mkdown-1.0.1.tar.gz
Algorithm Hash digest
SHA256 3d0591f1ff16d513eefa36e3f9de4e52121e2c5dccc35b583a803e994d9f7125
MD5 20c3e903d6bb4f86f29657af710da195
BLAKE2b-256 0c2f7c01c56d2ece11838505baeb7169b1d37593be8dbc9958a665660e401ce1

See more details on using hashes here.

Provenance

The following attestation bundles were made for mkdown-1.0.1.tar.gz:

Publisher: build.yml on phil65/mkdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mkdown-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: mkdown-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mkdown-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5d027f59f333670617e7f500c79377f4c022ec28be4500732f2a5750113da7e9
MD5 4a3ec435094a9f2267c925534fec8158
BLAKE2b-256 0d3429cac5028652780cb9709944c5244660c30011812ade255cd4d7104402c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for mkdown-1.0.1-py3-none-any.whl:

Publisher: build.yml on phil65/mkdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page