Skip to main content

A Python library for collating eScriptorium documents.

Project description

eScriptorium Collate

PyPI - Version PyPI - Python Version

A Python library for collating eScriptorium documents. This is a pre-release version in public alpha.


Table of Contents

Installation

Requirements

  • Python 3
  • Java Runtime Environment (< 15)

Vendored Binaries

This package uses the CollateX collatex-tools-1.7.1.jar Java Archive. The Jar file is bundled with the package, so there is no need to download it separately. However, you will need to ensure that your system has a working Java Runtime Environment version < 15 accessible under JAVA_HOME. Click here for more information about CollateX.

Virtual Environment

Before installing the package, it is a good idea to create a Python virtual environment:

pip install virtualenv
virtualenv -p python3 venv
source venv/bin/activate

Alternatively:

python3 -m venv venv
source venv/bin/activate

Click here for a more detailed guide to Python virtual environments.

Install

Once the virtual environment is activated, install the package:

pip install escriptorium-collate

Quick Start

Instantiate the eScriptorium Connector

import os

from dotenv import load_dotenv
from escriptorium_connector import EscriptoriumConnector

load_dotenv(override=True)
url = str(os.getenv("ESCRIPTORIUM_URL"))
username = str(os.getenv("ESCRIPTORIUM_USERNAME"))
password = str(os.getenv("ESCRIPTORIUM_PASSWORD"))
api_key = str(os.getenv("ESCRIPTORIUM_API_KEY"))

if api_key:
    escr = EscriptoriumConnector(url, api_key=api_key)
else:
    escr = EscriptoriumConnector(url, username, password)

escr = EscriptoriumConnector(url, username, password)

The .env file should look like this:

ESCRIPTORIUM_URL=your_escriptorium_url
ESCRIPTORIUM_API_KEY=your_escriptorium_api_key
ESCRIPTORIUM_USERNAME=your_escriptorium_username
ESCRIPTORIUM_PASSWORD=your_escriptorium_password

You need only provide ESCRIPTORIUM_API_KEY or both ESCRIPTORIUM_USERNAME and ESCRIPTORIUM_PASSWORD.

Instantiate the Witness that will be Collated

from escriptorium_collate.collate import Witness

witnesses = [
    Witness(
        doc_pk=1,
        siglum="A",
        diplomatic_transcription_name="diplomatic",
        normalized_transcription_name="normalized",
    ),
    Witness(
        doc_pk=2,
        siglum="B",
        diplomatic_transcription_name="diplomatic",
        normalized_transcription_name="normalized",
    ),
    Witness(
        doc_pk=3,
        siglum="C",
        diplomatic_transcription_name="diplomatic",
        normalized_transcription_name="normalized",
    ),
]

Instantiate the Arguments to be Passed to CollateX

from escriptorium_collate.collate import CollatexArgs

collatex_args = CollatexArgs()

Return the CollateX Results as a Python Dictionary

from escriptorium_collate.collate import collate

collatex_output = collate(escr=escr, witnesses=witnesses, collatex_args=collatex_args)

API

This packaged contains two modules: escriptorium_collate/collate.py and escriptorium_collate/transcription_layers.py.

escriptorium_collate.collate

escriptorium_collate.collate.Witness

An interface for defining an eScriptorium document as a witness to be passed to CollateX.

class Witness(BaseModel):
  doc_pk: int # Primary key of an eScriptorium document (int)
  siglum: str # Arbitrary siglum to be used in the critical apparatus (str)
  diplomatic_transcription_pk: int | None
  diplomatic_transcription_name: str | None
  normalized_transcription_pk: int | None
  normalized_transcription_name: str | None

If diplomatic_transcription_pk is provided, diplomatic_transcription_name is ignored. Likewise, if normalized_transcription_pk is provided, normalized_transcription_name is ignored.

The "diplomatic" transcription is not collated, rather, it is simply "passed through" to the CollateX output. It is the "normalized" transcription that is collated.

escriptorium_collate.collate.CollatexArgs

A (Python) interface for passing arguments to the CollateX command line interface. For more details about the arguments accepted by the CollateX Jar CLI, consult CollateX's documentation.

class CollatexArgs(BaseModel):
  algorithm: Literal["needleman-wunsch", "medite", "dekker"] = "needleman-wunsch"
  distance: int | None
  dot_path: str | None
  format: Literal["tei", "json", "dot", "graphml", "tei"] = "json"
  input: str | None
  input_encoding: str | None
  max_collation_size: int | None
  max_parallel_collations: int | None
  output_encoding: str | None
  output: str | None
  tokenized: bool = False
  token_comparator: Literal["equality", "levenshtein"] = "equality"

escriptorium_collate.collate.get_collatex_input

Given two or more Witness instances and a set of CollateX arguments, return the input JSON that will be later passed to CollateX.

from escriptorium_collate.collate import get_collatex_input

collatex_input = get_collatex_input(
  escr=escr, # An EscriptoriumConnector instance
  witnesses=witnesses, # A list of two or more Witness instances to be collated
  collatex_args=collatex_args, # An instance of CollatexArgs
)

escriptorium_collate.collate.get_collatex_output

Pass a given instance of CollatexArgs to the CollateX JAR.

from escriptorium_collate.collate import get_collatex_output

collatex_output = get_collatex_output(
  collatex_args=collatex_args, # An instance of CollatexArgs
)

In this case, CollatexArgs.input is mandatory; in other words, the CollateX input JSON must be manually passed in.

escriptorium_collate.collate.collate

Run the complete collation pipeline via one function call. See the "Quick Start" section above.

escriptorium_collate.transcription_layers

This module contains helper functions for dealing with the transcription layers of any given eScriptorium document.

escriptorium_collate.transcription_layers.create

Create and initialize an arbitrarily named transcription layer within a given eScriptorium document.

from escriptorium_collate import transcription_layers

transcription_layers.create(
  escr=escr, # EscriptoriumConnector instance
  doc_pk=1, # Primary key of an eScriptorium document (int)
  layer_name="New Layer" # Name of the transcription layer to be created (str)
)

escriptorium_collate.transcription_layers.copy

Copy the content of one transcription layer to another transcription layer in a given eScriptorium document.

from escriptorium_collate import transcription_layers

transcription_layers.copy(
  escr=escr, # EscriptoriumConnector instance
  doc_pk=1, # Primary key of an eScriptorium document (int)
  source_transcription_layer_name="Source Layer" # Name of the transcription layer to be copied (str)
  target_transcription_layer_name="Target Layer" # Name of the transcription layer to be written into (str)
  overwrite=True # If True, content of the target transcription layer is overwritten (default: False)
)

escriptorium_collate.transcription_layers.get_transcription_pk_by_name

Each transcription layer is assigned a unique identifier (primary key) by eScriptorium, but it is not easy to retrieve the primary key via eScriptorium's user interface. This simple helper function returns the transcription layer's primary key, given its name and the primary key of the document to which it belongs.

from escriptorium_collate import transcription_layers

transcription_layers.get_transcription_pk_by_name(
  escr=escr, # EscriptoriumConnector instance
  doc_pk=1, # Primary key of an eScriptorium document (int)
  transcription_name="Source Layer" # Name of the desired transcription layer (str)
)

License

escriptorium-collate is distributed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

escriptorium_collate-0.1.6.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

escriptorium_collate-0.1.6-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file escriptorium_collate-0.1.6.tar.gz.

File metadata

  • Download URL: escriptorium_collate-0.1.6.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for escriptorium_collate-0.1.6.tar.gz
Algorithm Hash digest
SHA256 e451c4e8e61fbf84312b64bf56ff720387711dde2c18948795d29fa6760ae066
MD5 0867e8220f7b0e7439c6df347699d51b
BLAKE2b-256 097ed6f92e2dbf6ca931ee15c72b051f548c395a55aceff598f6a107222a4557

See more details on using hashes here.

File details

Details for the file escriptorium_collate-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for escriptorium_collate-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b29cd199cb018e3e077d5d5349dacafde10fbc2f4144ff92a45ac8738e14ca62
MD5 c22203761999043ab03739bc3c64cd10
BLAKE2b-256 7dc30b9dc3b44cd01092bd2f840cd98254d709973a6054a4048acd79662d49c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page