Markdown helpers & models
Project description
mkdown
Markdown Conventions for OCR Output
This project utilizes Markdown as the primary, self-contained format for storing OCR results and associated metadata. The goal is to have a single, versionable, human-readable file representing a processed document, simplifying pipeline management and data provenance.
We employ a hybrid approach, using different mechanisms for different types of metadata:
1. Metadata Comments (for Non-Visual Markers)
For metadata that should not affect the visual rendering of the Markdown (like page boundaries or page-level information), we use specially formatted HTML/XML comments.
Format:
<!-- docler:data_type {json_payload} -->
data_type: A string indicating the kind of metadata (e.g.,page_break,chunk_boundary).{json_payload}: A standard JSON object containing the metadata key-value pairs, serialized.
Defined Types:
page_break: Marks the transition to the specified page number. Placed immediately before the content of the new page.- Example Payload:
{"next_page": 2} - Example Comment:
<!-- docler:page_break {"next_page": 2 } -->
- Example Payload:
chunk_boundary: Marks a transition where a document should get chunked (semantically).- Example Payload:
{"chunk_id": 1} - Example Comment:
<!-- docler:chunk_boundary {"chunk_id": 1 } -->
- Example Payload:
2. HTML Figures (for Images and Diagrams)
For visual elements like images or diagrams, especially when they require richer metadata (like source code or bounding boxes), we use standard HTML structures within the Markdown. This allows direct association of metadata and handles complex data like code snippets gracefully.
Structure:
We typically use an HTML <figure> element:
<figure data-docler-type="diagram" data-diagram-id="sysarch-01">
<img src="images/system_architecture.png"
alt="System Architecture Diagram"
data-page-num="5"
style="max-width: 100%; height: auto;"
>
<figcaption>Figure 2: High-level system data flow.</figcaption>
<script type="text/docler-mermaid">
graph LR
A[Data Ingest] --> B(Processing Queue);
B --> C{Main Processor};
D --> F(API Endpoint);
</script>
</figure>
<figure>: The container element.data-docler-type: Indicates the type of figure (e.g.,image,diagram).- Other
data-*attributes can be added for figure-level metadata.
<img>: The visual representation.src,alt: Standard attributes.data-*: Used for image-specific metadata likedata-page-numstyle: Optional for basic presentation.
<figcaption>: Optional standard HTML caption.<script type="text/docler-...">: Used to embed source code or other complex textual data.- The
typeattribute is custom (e.g.,text/docler-mermaid,text/docler-latex) so browsers ignore it. - The raw code/text is placed inside, preserving formatting.
- The
Rationale
- Comments are used for page breaks and metadata because they are guaranteed not to interfere with Markdown rendering, ensuring purely structural information remains invisible.
- HTML Figures are used for images/diagrams because HTML provides standard ways (
data-*, nested elements like<script>) to directly associate rich, potentially complex or multi-line metadata (like source code) with the visual element itself.
Utilities
Helper functions for creating and parsing these metadata comments and structures are available in docler.markdown_utils.
Standardized Metadata Types
The library provides standardized metadata types for common use cases:
-
Page Breaks: Use
PAGE_BREAK_TYPEconstant andcreate_metadata_comment()function to create page transitions:from docler.markdown_utils import create_metadata_comment, PAGE_BREAK_TYPE # Create a page break marker for page 2 page_break = create_metadata_comment(PAGE_BREAK_TYPE, {"next_page": 2}) # <!-- docler:page_break {"next_page":2} -->
-
Chunk Boundaries: Use
CHUNK_BOUNDARY_TYPEconstant andcreate_chunk_boundary()function to mark semantic chunks in a document:from docler.markdown_utils import create_chunk_boundary # Create a chunk boundary marker with metadata chunk_marker = create_chunk_boundary( chunk_id=1, start_line=10, end_line=25, keywords=["introduction", "overview"], token_count=350, ) # <!-- docler:chunk_boundary {"chunk_id":1,"end_line":25,"keywords":["introduction","overview"],"start_line":10,"token_count":350} -->
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mkdown-0.13.2.tar.gz.
File metadata
- Download URL: mkdown-0.13.2.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edca74fd59370b1a9edf9ad3718243e851d60f925ab4759e2ab50521c6e02306
|
|
| MD5 |
f6e7ce4096acf6572193037664d9698b
|
|
| BLAKE2b-256 |
9d4202d33fe2b74854ada9c5db8c074c666f3c23097ae821fda2fc68c9da5eb7
|
Provenance
The following attestation bundles were made for mkdown-0.13.2.tar.gz:
Publisher:
build.yml on phil65/mkdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mkdown-0.13.2.tar.gz -
Subject digest:
edca74fd59370b1a9edf9ad3718243e851d60f925ab4759e2ab50521c6e02306 - Sigstore transparency entry: 211675348
- Sigstore integration time:
-
Permalink:
phil65/mkdown@1c9de43f910bf40e61fa30d494807ba29cfb8dc5 -
Branch / Tag:
refs/tags/v0.13.2 - Owner: https://github.com/phil65
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yml@1c9de43f910bf40e61fa30d494807ba29cfb8dc5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mkdown-0.13.2-py3-none-any.whl.
File metadata
- Download URL: mkdown-0.13.2-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
740c87e4ad30faa72bb70f00c07517e27145aaa74101ae8758ccf91d89e09b94
|
|
| MD5 |
8fea53de2938a626df1c4fc57a6fd3d5
|
|
| BLAKE2b-256 |
d26bcb510128ecef7114d317009aa5abcfcb91d34b5465ce62155b305dd18ad0
|
Provenance
The following attestation bundles were made for mkdown-0.13.2-py3-none-any.whl:
Publisher:
build.yml on phil65/mkdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mkdown-0.13.2-py3-none-any.whl -
Subject digest:
740c87e4ad30faa72bb70f00c07517e27145aaa74101ae8758ccf91d89e09b94 - Sigstore transparency entry: 211675357
- Sigstore integration time:
-
Permalink:
phil65/mkdown@1c9de43f910bf40e61fa30d494807ba29cfb8dc5 -
Branch / Tag:
refs/tags/v0.13.2 - Owner: https://github.com/phil65
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yml@1c9de43f910bf40e61fa30d494807ba29cfb8dc5 -
Trigger Event:
push
-
Statement type: