Pydantic models for representing a text document as a hierarchical structure.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oneofftech

These details have not been verified by PyPI

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python

Project description

pypi

Parse Document Model (Python)

Parse Document Model (Python) provides Pydantic models for representing text documents using a hierarchical model. This library allows you to define documents as a hierarchy of (specialised) nodes where each node can represent a document, page, text, heading, body, and more.

These models aim to preserve the underlying structure of text documents for further processing, such as creating a table of contents or transforming between formats, e.g. converting a parsed PDF to Markdown.

Hierarchical structure: The document is modelled as a hierarchy of nodes. Each node can represent a part of the document itself, pages, text.
Rich text support: Nodes can represent not only the content but also the formatting (e.g. bold, italic) applied to the text.
Attributes: Each node can have attributes that provide additional information such as page number, bounding box, etc.
Built-in validation and types: Built with Pydantic, ensuring type safety, validation and effortless creation of complex document structures.

Requirements

Python 3.12 or above (Python 3.9, 3.10 and 3.11 are supported on best-effort).

Next steps

Explore the document model
Install the library and use the models

Document Model Overview

We want to represent the document structure using a hierarchy so that the inherited structure is preserved when chapters, sections and headings are used. Consider a generic document with two pages, one heading per page and one paragraph of text. The resulting representation might be the following.

Document
 ├─Page
 │  ├─Text (category: heading)
 │  └─Text (category: body)
 └─Page
    ├─Text (category: heading)
    └─Text (category: body)

At a glance you can see the structure, the document is composed of two pages and there are two headings. To do so we defined a hierarchy around the concept of a Node, like a node in a graph.

Node types

classDiagram
    class Node
    Node <|-- StructuredNode
    Node <|-- Text
    StructuredNode <|-- Document
    StructuredNode <|-- Page

1. Node (Base Class)

This is the abstract class from which all other nodes inherit.

Each node has:

category: The type of the node (e.g., doc, page, heading).
attributes: Optional field to attach extra data to a node. See Attributes.

2. StructuredNode

This extends the Node. It is used to represent the hierarchy as a node whose content is a list of other nodes, such as like Document and Page.

content: List of Node.

3. Document

This is the root node of a document.

category: Always set to "doc".
attributes: Document-wide attributes can be set here.
content: List of Page nodes that form the document.

4. Page

Represents a page in the document:

category: Always set to "page".
attributes: Can contain metadata like page number.
content: List of Text nodes on the page.

5. Text

This node represent a paragraph, a heading or any text within the document.

category: The category of the text within the document, e.g. heading, title
content: A string representing the textual content.
marks: List of marks applied to the text, such as bold, italic, etc.
attributes: Can contain metadata like the bounding box representing where this portion of text is located in the page.

Marks

Marks are used to add style or functionality to the text within a Text node. For example, bold text, italic text, links and custom styles such as font or colour.

Mark Types

Bold: Represents bold text.
Italic: Represents italic text.
TextStyle: Allows customization of font and color.
Link: Represents a hyperlink.

Marks are validated and enforced with the help of Pydantic model validators.

Attributes

Attributes are optional fields that can store additional information for each node. Some predefined attributes are:

DocumentAttributes: General attributes for the document (currently reserved for the future).
PageAttributes: Specific page related attributes, such as the page number.
TextAttributes: Text related attributes, such as bounding boxes or level.
BoundingBox: A box that specifies the position of a text in the page.
Level: The specific level of the text within a document, for example, for headings.

Getting started

Installation

Parse Document Model is distributed with PyPI. You can install it with pip.

pip install parse-document-model

Quick Example

Here’s how you can represent a simple document with one page and some text:

from document_model_python.document import Document, Page, Text

doc = Document(
    category="doc",
    content=[
        Page(
            category="page",
            content=[
                Text(
                    category="heading",
                    content="Welcome to parse-document-model",
                    marks=["bold"]
                ),
                Text(
                    category="body",
                    content="This is an example text using the document model."
                )
            ]
        )
    ]
)

Testing

Parse Document Model is tested using pytest. Tests run for each commit and pull request.

Install the dependencies.

pip install -r requirements.txt -r requirements-dev.txt

Execute the test suite.

pytest

Contributing

Thank you for considering contributing to the Parse Document Model! The contribution guide can be found in the CONTRIBUTING.md file.

[NOTE] Consider opening a discussion before submitting a pull request with changes to the model structures.

Security Vulnerabilities

Please review our security policy on how to report security vulnerabilities.

Credits

Supporters

The project is provided and supported by OneOff-Tech (UG).

Aknowledgements

The format and structure takes inspiration from ProseMirror.

License

The MIT License (MIT). Please see License File for more information.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oneofftech

These details have not been verified by PyPI

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python

Release history Release notifications | RSS feed

This version

0.2.2

Apr 8, 2025

0.2.1

Mar 25, 2025

0.2.0

Sep 24, 2024

0.1.0

Sep 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parse_document_model-0.2.2.tar.gz (8.3 kB view details)

Uploaded Apr 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parse_document_model-0.2.2-py3-none-any.whl (7.9 kB view details)

Uploaded Apr 8, 2025 Python 3

File details

Details for the file parse_document_model-0.2.2.tar.gz.

File metadata

Download URL: parse_document_model-0.2.2.tar.gz
Upload date: Apr 8, 2025
Size: 8.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parse_document_model-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`3f6a5ed2c2222103551885d40cb1d4638659a4b77a5a43b37a1dd539d6b3d9b5`
MD5	`0d1de1e5a7764d8bb3be093f18816d9a`
BLAKE2b-256	`897b88b504ded0c70c139d1d67a752b6f17307fb5f928c470226e90772c4dd53`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parse_document_model-0.2.2.tar.gz:

Publisher: release.yml on OneOffTech/parse-document-model-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parse_document_model-0.2.2.tar.gz
- Subject digest: 3f6a5ed2c2222103551885d40cb1d4638659a4b77a5a43b37a1dd539d6b3d9b5
- Sigstore transparency entry: 193984683
- Sigstore integration time: Apr 8, 2025
Source repository:
- Permalink: OneOffTech/parse-document-model-python@85149ee3f7c1451df44bd579585f974676be7996
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/OneOffTech
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@85149ee3f7c1451df44bd579585f974676be7996
- Trigger Event: release

File details

Details for the file parse_document_model-0.2.2-py3-none-any.whl.

File metadata

Download URL: parse_document_model-0.2.2-py3-none-any.whl
Upload date: Apr 8, 2025
Size: 7.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parse_document_model-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2771a20eb05129f3108ecd9ed59cbf518cf9061bdf535085f30c69db2908ca52`
MD5	`556c46d47c0190fee89f98dfd4543cfa`
BLAKE2b-256	`fae5281e9ae974e1dd9b53db862d86d356e3c7119ccca70d221908051ee89cee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parse_document_model-0.2.2-py3-none-any.whl:

Publisher: release.yml on OneOffTech/parse-document-model-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parse_document_model-0.2.2-py3-none-any.whl
- Subject digest: 2771a20eb05129f3108ecd9ed59cbf518cf9061bdf535085f30c69db2908ca52
- Sigstore transparency entry: 193984698
- Sigstore integration time: Apr 8, 2025
Source repository:
- Permalink: OneOffTech/parse-document-model-python@85149ee3f7c1451df44bd579585f974676be7996
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/OneOffTech
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@85149ee3f7c1451df44bd579585f974676be7996
- Trigger Event: release

parse-document-model 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Parse Document Model (Python)

Document Model Overview

Node types

1. Node (Base Class)

2. StructuredNode

3. Document

4. Page

5. Text

Category

Marks

Attributes

Getting started

Installation

Quick Example

Testing

Contributing

Security Vulnerabilities

Credits

Supporters

Aknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance