Skip to main content

A Python library for collating eScriptorium documents.

Project description

eScriptorium Collate

PyPI - Version PyPI - Python Version

A Python library for collating eScriptorium documents. This is a pre-release version in public alpha.


Table of Contents

Installation

Requirements

  • Python 3
  • Java Runtime Environment (< 15)

Vendored Binaries

This package uses the CollateX collatex-tools-1.7.1.jar Java Archive. The Jar file is bundled with the package, so there is no need to download it separately. However, you will need to ensure that your system has a working Java Runtime Environment version < 15 accessible under JAVA_HOME. Click here for more information about CollateX.

Virtual Environment

Before installing the package, it is a good idea to create a Python virtual environment:

pip install virtualenv
virtualenv -p python3 venv
source venv/bin/activate

Alternatively:

python3 -m venv venv
source venv/bin/activate

Click here for a more detailed guide to Python virtual environments.

Install

Once the virtual environment is activated, install the package:

pip install escriptorium-connector @ git+https://gitlab.com/oeshera/escriptorium_python_connector
pip install escriptorium-collate

[!NOTE]
This package depends on escriptorium-connector. However, the version of escriptorium-connector currently published on PyPi is not up to date with the latest development version of eScriptorium. Depending on the version of eScriptorium you are using, the PyPi version of escriptorium-connector may fail. As a temporary solution, the above-mentioned fork of escriptorium-connector can be used. It will work in most cases.

Quick Start

Instantiate the eScriptorium Connector

import os

from dotenv import load_dotenv
from escriptorium_connector import EscriptoriumConnector

load_dotenv(override=True)
url = str(os.getenv("ESCRIPTORIUM_URL"))
username = str(os.getenv("ESCRIPTORIUM_USERNAME"))
password = str(os.getenv("ESCRIPTORIUM_PASSWORD"))
api_key = os.getenv("ESCRIPTORIUM_API_KEY")

if api_key:
    escr = EscriptoriumConnector(url, api_key=str(api_key))
else:
    escr = EscriptoriumConnector(url, username, password)

The .env file should look like this:

ESCRIPTORIUM_URL=your_escriptorium_url
ESCRIPTORIUM_API_KEY=your_escriptorium_api_key
ESCRIPTORIUM_USERNAME=your_escriptorium_username
ESCRIPTORIUM_PASSWORD=your_escriptorium_password

You need only provide ESCRIPTORIUM_API_KEY or both ESCRIPTORIUM_USERNAME and ESCRIPTORIUM_PASSWORD.

Instantiate the Witness that will be Collated

from escriptorium_collate.collate import Witness

witnesses = [
    Witness(
        doc_pk=1,
        siglum="A",
        diplomatic_transcription_name="diplomatic",
        normalized_transcription_name="normalized",
    ),
    Witness(
        doc_pk=2,
        siglum="B",
        diplomatic_transcription_name="diplomatic",
        normalized_transcription_name="normalized",
    ),
    Witness(
        doc_pk=3,
        siglum="C",
        diplomatic_transcription_name="diplomatic",
        normalized_transcription_name="normalized",
    ),
]

Instantiate the Arguments to be Passed to CollateX

from escriptorium_collate.collate import CollatexArgs

collatex_args = CollatexArgs()

Return the CollateX Results as a Python Dictionary

from escriptorium_collate.collate import collate

collatex_output = collate(escr=escr, witnesses=witnesses, collatex_args=collatex_args)

API

This packaged contains two modules: escriptorium_collate/collate.py and escriptorium_collate/transcription_layers.py.

escriptorium_collate.collate

escriptorium_collate.collate.Witness

An interface for defining an eScriptorium document as a witness to be passed to CollateX.

class Witness(BaseModel):
  doc_pk: int # Primary key of an eScriptorium document (int)
  siglum: str # Arbitrary siglum to be used in the critical apparatus (str)
  diplomatic_transcription_pk: int | None
  diplomatic_transcription_name: str | None
  normalized_transcription_pk: int | None
  normalized_transcription_name: str | None

If diplomatic_transcription_pk is provided, diplomatic_transcription_name is ignored. Likewise, if normalized_transcription_pk is provided, normalized_transcription_name is ignored.

The "diplomatic" transcription is not collated, rather, it is simply "passed through" to the CollateX output. It is the "normalized" transcription that is collated.

escriptorium_collate.collate.CollatexArgs

A (Python) interface for passing arguments to the CollateX command line interface. For more details about the arguments accepted by the CollateX Jar CLI, consult CollateX's documentation.

class CollatexArgs(BaseModel):
  algorithm: Literal["needleman-wunsch", "medite", "dekker"] = "needleman-wunsch"
  distance: int | None
  dot_path: str | None
  format: Literal["tei", "json", "dot", "graphml", "tei"] = "json"
  input: str | None
  input_encoding: str | None
  max_collation_size: int | None
  max_parallel_collations: int | None
  output_encoding: str | None
  output: str | None
  tokenized: bool = False
  token_comparator: Literal["equality", "levenshtein"] = "equality"

escriptorium_collate.collate.get_collatex_input

Given two or more Witness instances and a set of CollateX arguments, return the input JSON that will be later passed to CollateX.

from escriptorium_collate.collate import get_collatex_input

collatex_input = get_collatex_input(
  escr=escr, # An EscriptoriumConnector instance
  witnesses=witnesses, # A list of two or more Witness instances to be collated
  collatex_args=collatex_args, # An instance of CollatexArgs
)

escriptorium_collate.collate.get_collatex_output

Pass a given instance of CollatexArgs to the CollateX JAR.

from escriptorium_collate.collate import get_collatex_output

collatex_output = get_collatex_output(
  collatex_args=collatex_args, # An instance of CollatexArgs
)

In this case, CollatexArgs.input is mandatory; in other words, the CollateX input JSON must be manually passed in.

escriptorium_collate.collate.collate

Run the complete collation pipeline via one function call. See the "Quick Start" section above.

escriptorium_collate.transcription_layers

This module contains helper functions for dealing with the transcription layers of any given eScriptorium document.

escriptorium_collate.transcription_layers.create

Create and initialize an arbitrarily named transcription layer within a given eScriptorium document.

from escriptorium_collate import transcription_layers

transcription_layers.create(
  escr=escr, # EscriptoriumConnector instance
  doc_pk=1, # Primary key of an eScriptorium document (int)
  layer_name="New Layer" # Name of the transcription layer to be created (str)
)

escriptorium_collate.transcription_layers.copy

Copy the content of one transcription layer to another transcription layer in a given eScriptorium document.

from escriptorium_collate import transcription_layers

transcription_layers.copy(
  escr=escr, # EscriptoriumConnector instance
  doc_pk=1, # Primary key of an eScriptorium document (int)
  source_transcription_layer_name="Source Layer" # Name of the transcription layer to be copied (str)
  target_transcription_layer_name="Target Layer" # Name of the transcription layer to be written into (str)
  overwrite=True # If True, content of the target transcription layer is overwritten (default: False)
)

escriptorium_collate.transcription_layers.get_transcription_pk_by_name

Each transcription layer is assigned a unique identifier (primary key) by eScriptorium, but it is not easy to retrieve the primary key via eScriptorium's user interface. This simple helper function returns the transcription layer's primary key, given its name and the primary key of the document to which it belongs.

from escriptorium_collate import transcription_layers

transcription_layers.get_transcription_pk_by_name(
  escr=escr, # EscriptoriumConnector instance
  doc_pk=1, # Primary key of an eScriptorium document (int)
  transcription_name="Source Layer" # Name of the desired transcription layer (str)
)

License

escriptorium-collate is distributed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

escriptorium_collate-0.1.12.tar.gz (1.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

escriptorium_collate-0.1.12-py3-none-any.whl (1.9 MB view details)

Uploaded Python 3

escriptorium_collate-0.1.12-py2.py3-none-any.whl (1.9 MB view details)

Uploaded Python 2Python 3

File details

Details for the file escriptorium_collate-0.1.12.tar.gz.

File metadata

  • Download URL: escriptorium_collate-0.1.12.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for escriptorium_collate-0.1.12.tar.gz
Algorithm Hash digest
SHA256 8f90aaeb7efe3cbcb8a91e5e2929e4ecb16b5656ad27bd7aad6aece1eb3ca3bf
MD5 c3f0402bb1efd579de8930b48ecde3d2
BLAKE2b-256 152fbb1a7a5ad25c3b6b6400b120399d8a914d1a8a80970f9bb1eba805e96e4e

See more details on using hashes here.

File details

Details for the file escriptorium_collate-0.1.12-py3-none-any.whl.

File metadata

File hashes

Hashes for escriptorium_collate-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 6029ee97f80c91fcdd3aa387ad1bd9287e05b4135568cae511aa6de3164e48f9
MD5 53f4db144a4c1177c1d1883236ee09b5
BLAKE2b-256 9de44d69db119ebab09eda87d5348d9bcdb40dfdf31abc5200796502838527b5

See more details on using hashes here.

File details

Details for the file escriptorium_collate-0.1.12-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for escriptorium_collate-0.1.12-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ee6f3b0cbe5cf61a49ac268ce1866b678be84f108a28db815e35f453d063611f
MD5 be5a0b8ffc2dbff0c895c8a06b51cebb
BLAKE2b-256 1ce4308e1c22d58d7d13b126d4a43fa9faac3558d6c3f2b96bfc8a6b4dfe0864

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page