Skip to main content

OpenPecha toolkit version 2

Project description

OpenPecha Toolkit V2

OpenPecha

Toolkit V2

A Python package for working with stand-off text annotations in the OpenPecha framework, built around the Stand-off Text Annotation Model (STAM). Toolkit V2 features robust parsing, transformation, and serialization of annotated buddhist textual corpora.


Table of Contents


Introduction

Toolkit V2 is the next-generation Python toolkit for managing annotated texts in the OpenPecha ecosystem. It provides:

  • Tools for creating, editing, and serializing annotated corpora using the STAM model.
  • Support for multiple annotation types (segmentation, alignment, pagination, language, etc.).
  • Parsers for various input formats (DOCX, OCR, Pedurma, etc.).
  • Serializers for exporting annotated data.

STAM (Stand-off Text Annotation Model) is a flexible data model for representing all information about a text as stand-off annotations, keeping the base text and annotations separate for maximum interoperability.

OpenPecha Backend hosted on Firebase, serves as the central storage system for texts and their corresponding annotations. While the toolkit handles parsing, editing, and serialization, all storage, access, and import operations are managed by the backend.


Installation

Stable version:

pip install openpecha

Development version:

pip install git+https://github.com/OpenPecha/toolkit-v2.git

Key Concepts

Pecha

A Pecha is the core data model representing a text corpus with its annotations and metadata. Each Pecha:

  • Has a unique ID (8-digit UUID)
  • Contains one or more base texts
  • Stores multiple annotation layers
  • Includes metadata (title, author, language, etc.)
  • Can be created from scratch or parsed from various formats (DOCX, OCR, etc.)
├── metadata.json
├── base/
│   ├── base1.txt
│   └── base2.txt
└── layers/
    ├── segmentation-1234.json
    ├── alignment-5678.json
    ├── pagination-9012.json
    └── footnote-3456.json

Example of a Pecha's internal structure:

├── metadata.json
│   ├── id: "P0001"
│   ├── title: {"en": "Sample Text", "bo": "དཔེ་ཚན།"}
│   ├── author: "Author Name"
│   └── language: "bo"
├── base/
│   └── base1.txt
│       └── "ཨོཾ་མ་ཎི་པདྨེ་ཧཱུྃ།..."
└── layers/
    ├── Segmentation-1234.json
    │   └── {"index": 1, "span": {"start": 0, "end": 10}, ...}
    ├── Alignment-5678.json
    │   └── {"alignment_index": "1-2", "span": {"start": 0, "end":   20}, ...}
    └── Pagination-9012.json
        └── {"page": 1, "span": {"start": 0, "end": 100}, ...}

Layer

A Layer is a collection of annotations of a specific type for a given base text. Key features:

  • Each layer has a specific type (e.g., Segmentation, Alignment, Pagination)
  • Layers are stored as JSON files in the STAM format
  • Common layer types include:
    • Segmentation: Divides text into meaningful segments
    • Alignment: Maps segments between different texts (e.g., root text and commentary)
    • Pagination: Marks page boundaries
    • Language: Indicates language of text segments
    • Footnote: Contains footnote annotations

STAM (Stand-off Text Annotation Model)

STAM is the underlying data format for storing annotations. It:

  • Keeps base text and annotations separate
  • Uses a flexible JSON structure
  • Supports multiple annotation types
  • Enables interoperability between different systems
  • Allows for complex annotation relationships

Alignment Transfer

Alignment refers to mapping relationships between two or more texts. This process is crucial for creating parallel texts, which are widely used in translation, commentary analysis, and language learning. Alignments help link corresponding sections across different versions or types of texts—whether it's between a root text and its translation, a commentary, or other related materials.


Getting Started & Usage Guide

To get started and explore all features, see the Getting Started & Usage Guide.


Tutorial Guide

To see a story-driven walkthrough of parsing, annotating, and serializing a Tibetan text, with code and explanations., see the Tutorial Guide

Serializer

The JsonSerializer class provides utilities for extracting and serializing annotation data from a Pecha. Key methods include:

  • get_base(pecha): Returns the base text from the first base in the given Pecha.
  • to_dict(ann_store, ann_type): Converts an AnnotationStore to a list of annotation dictionaries for the given annotation type.
  • get_edition_base(pecha, edition_layer_path): Constructs a new base text by applying version variant operations (insertions/deletions) from an edition layer.
  • serialize(pecha, manifestation_info): Serializes a Pecha with its annotations based on manifestation information, returning base text and annotations.
  • serialize_edition_annotations(pecha, edition_layer_path, layer_path): Serializes annotations that are based on an edition base rather than the original base.

See the API Reference for full details and usage examples.

API Reference

For a detailed list of classes and methods, see the API Reference.


Diving Deeper


Contributing

We welcome contributions! Please open issues or pull requests. For major changes, please open an issue first to discuss what you would like to change.


License

This project is licensed under the MIT License. See the LICENSE file for details.


Project Owners

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openpecha-2.5.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openpecha-2.5.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file openpecha-2.5.0.tar.gz.

File metadata

  • Download URL: openpecha-2.5.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for openpecha-2.5.0.tar.gz
Algorithm Hash digest
SHA256 e13a2f2a4ce26c0f7378331ac0695b0d08c35f19af92f2b2f2c9bec4413494dd
MD5 80f859bfdec1aa8b52cd0588916c734b
BLAKE2b-256 e10ca7bef9dee4cb50f03fd2a55261764274755038abe9ba7bcd8afb702db922

See more details on using hashes here.

File details

Details for the file openpecha-2.5.0-py3-none-any.whl.

File metadata

  • Download URL: openpecha-2.5.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for openpecha-2.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 197518b853ffa6a8c97df10b07de4b79dc36c06d542e1f46df793ec46ced4089
MD5 1c376c036a614b87ab1fa3e624487b81
BLAKE2b-256 0a24bf345efb78df0d0c1908d3d942c2080452bbfda9d4d72fe1051bdcf70e1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page