Skip to main content

Rdgai facilitates the use of LLMs for classifying transitions between variant readings in a Text Encoding Initiative (TEI) XML file containing a critical apparatus.

Project description

rdgai

pypi badge testing badge coverage badge docs badge black badge

Rdgai facilitates the use of LLMs for classifying transitions between variant readings in a Text Encoding Initiative (TEI) XML file containing a critical apparatus. It enables users to define classification categories, manually annotate changes, and use an LLM to automate the classification process. The TEI XML can then be used for phylogenetic analysis of textual traditions using teiphy.

Background information about the use of classifying variants in this way can be found on the Why use Rdgai? documentation.

Documentation is available at https://rbturnbull.github.io/rdgai.

Installation

Install using pip:

pip install rdgai

Or install directly from the repository:

pip install git+https://github.com/rbturnbull/rdgai.git

Usage

See all the options with the command:

rdgai --help

Preparation

You first need to prepare a TEI XML file with a critical apparatus.

Define categories in the TEI XML header under <interpGrp type="transcriptional">. For example:

<interpGrp type="transcriptional">
    <interp xml:id="Addition" corresp="#Omission">An addition of a word or words.</interp>
    <interp xml:id="Omission" corresp="#Addition">An omission of a word or words.</interp>
    <interp xml:id="Substituion">A substitution of a word or words.</interp>
</interpGrp>

Then use the graphical user interface (GUI) to classify transitions via buttons or keyboard navigation in a browser-based GUI.

rdgai gui apparatus.xml output.xml

Or export classifications to Excel for collaborative editing:

rdgai export apparatus.xml reading-pairs.xlsx

Edit in Excel and re-import with:

rdgai import-classifications apparatus.xml reading-pairs.xlsx output.xml

More information about preparing the TEI XML file can be found in the Preparation documentation.

Validation

The accuracy of Rdgai is dependent on the type of text, the categories and their definitions and the LLM used. The accuracy needs to be validated on each document used with Rdgai. For this purpose, Rdgai comes with a validation tool which assigns a proportion of the manual annotations to be allowed for use in the prompt and the remainder are used as ground truth annotations for evaluating the results from Rdgai.

To run the validation tool, use the following command:

rdgai validate apparatus.xml output.xml --report output.html --proportion 0.5 --llm claude-3-5-sonnet-20241022 --examples 20

The HTML report will show the accuracy, precision, recall, F1 scores, confusion matrix, and detailed classifications (correct/incorrect). The LLM then gives suggestions for clarifying the definitions of the categories and alerts the user to any inconsistencies in the ground truth annotations.

More information about validating the results of Rdgai for your TEI XML file can be found in the Validation documentation.

Classification

After validating, you can classify the unclassified reading changes using the following command:

rdgai classify apparatus.xml output.xml --llm claude-3-5-sonnet-20241022 --examples 20

View the output TEI XML in the Rdgai GUI with:

rdgai gui output.xml --inplace

More information about making automated classifications using Rdgai can be found in the Classification documentation.

Credits

Robert Turnbull For more information contact: <robert.turnbull@unimelb.edu.au>

The article about Rdgai will be published in the near future. For now, please cite the repository and some of the following articles:

  • Robert Turnbull, “Transmission History” Pages 156–204 in Codex Sinaiticus Arabicus and Its Family: A Bayesian Approach. Vol. 66. New Testament Tools, Studies and Documents. Brill, 2025. https://doi.org/10.1163/9789004704619_007

  • Joey McCollum and Robert Turnbull. “teiphy: A Python Package for Converting TEI XML Collations to NEXUS and Other Formats.” Journal of Open Source Software 7, no. 80 (2022): 4879. https://doi.org/10.21105/joss.04879

  • Joey McCollum and Robert Turnbull. “Using Bayesian Phylogenetics to Infer Manuscript Transmission History.” Digital Scholarship in the Humanities 39, no. 1 (2024): 258–79. https://doi.org/10.1093/llc/fqad089

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdgai-0.1.2.tar.gz (161.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdgai-0.1.2-py3-none-any.whl (171.1 kB view details)

Uploaded Python 3

File details

Details for the file rdgai-0.1.2.tar.gz.

File metadata

  • Download URL: rdgai-0.1.2.tar.gz
  • Upload date:
  • Size: 161.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.1 Darwin/24.6.0

File hashes

Hashes for rdgai-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3ab6387fbd2106265d3754cebf07d07f8df5d03845585a85803b5075952ffbc9
MD5 6882b4df845d28747db93330e8d94754
BLAKE2b-256 9dfb2b63294cf41f3648db1f15881afb6d6d34b02f3f6f497fbd365bb850a62e

See more details on using hashes here.

File details

Details for the file rdgai-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: rdgai-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 171.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.1 Darwin/24.6.0

File hashes

Hashes for rdgai-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d0735186b567897a7bb20525e2aafe88ffa90455aa92c15bbbf182ea06668904
MD5 67557b4153cc9dcd16e63168f09b79b8
BLAKE2b-256 2188eb06871c20d8e0eabbeea10bea3e44207e2ab525993f0dd36430e501d7e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page