Skip to main content

A Python library for working with a corpus of texts canonically citable by CtsUrn.

Project description

citable_corpus

A Python library for working with a corpus of texts canonically citable by CTS URN references.

Overview

citable_corpus lets you work with texts citable by CTS URNs(Canonical Text Services URN).

Features

  • Multiple input formats: Create corpora from delimited strings, or from files or URLs with data in CEX format.
  • Retrieval based on URN logic: Querying passages by URN recognizes work and passage hierarchies, as well as passage ranges.
  • CEX support: Native support for the CEX (CITE Exchange) format
  • Type-safe: Built with Pydantic for robust data validation

Installation

pip install citable_corpus

Quick Start

Creating a Corpus

From a delimited string

from citable_corpus import CitableCorpus

text = """urn:cts:latinLit:phi0959.phi006:1.1|Lorem ipsum
urn:cts:latinLit:phi0959.phi006:1.2|Dolor sit amet."""

corpus = CitableCorpus.from_string(text)
print(f"Loaded {len(corpus.passages)} passages")

From a CEX file

corpus = CitableCorpus.from_cex_file("path/to/file.cex")

From a URL

url = "https://example.com/corpus.cex"
corpus = CitableCorpus.from_cex_url(url)

Working with Passages

Each passage in a corpus is a CitablePassage object with a URN and text:

passage = corpus.passages[0]
print(passage.urn)   # The CtsUrn object
print(passage.text)  # The text content
print(str(passage))  # "urn:...: text"

Retrieving Passages

Retrieve a single passage by exact URN

from urn_citation import CtsUrn

ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:pr.1")
results = corpus.retrieve(ref)

Retrieve all passages from a work section

# Get all passages from the preface (pr.) section
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:pr")
results = corpus.retrieve(ref)

Retrieve all passages from a work

# Get all passages from the work (note the trailing colon)
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:")
results = corpus.retrieve(ref)

Retrieve a range of passages

# Get passages from pr.1 through pr.5
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:pr.1-pr.5")
results = corpus.retrieve_range(ref)

# Or use retrieve(), which automatically detects ranges
results = corpus.retrieve(ref)

API Reference

CitableCorpus

The main class for working with a corpus of citable texts.

Class Methods:

  • from_string(s: str, delimiter: str = "|") - Create from delimited text
  • from_cex_file(f: str, delimiter: str = "|") - Create from a CEX file
  • from_cex_url(url: str, delimiter: str = "|") - Create from a URL

Instance Methods:

  • retrieve(ref: CtsUrn) - Retrieve passages matching a URN reference
  • retrieve_range(ref: CtsUrn) - Retrieve passages in a URN range
  • len() - Get the number of passages in the corpus

Attributes:

  • passages: List[CitablePassage] - The list of passages in the corpus

CitablePassage

Represents a single citable passage of text.

Class Methods:

  • from_string(src: str, delimiter: str = "|") - Create from a delimited string

Attributes:

  • urn: CtsUrn - The CTS URN identifying this passage
  • text: str - The text content of the passage

Examples

Filtering and Processing

# Find all passages containing a specific word
matches = [p for p in corpus.passages if "Zeus" in p.text]

# Get URNs of all passages
urns = [p.urn for p in corpus.passages]

# Count passages by work
from collections import Counter
works = Counter(p.urn.work for p in corpus.passages)

Working with CEX Data

The library supports the CEX (CITE Exchange) format, commonly used in digital classics:

# Load a CEX file with Hyginus fables
corpus = CitableCorpus.from_cex_file("hyginus.cex")

# Retrieve text content of a specific passage
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:1pr.1")
psg = corpus.retrieve(ref)[0]
print(psg.text)

Requirements

  • Python >= 3.14
  • pydantic
  • urn-citation >= 0.4.1
  • cite-exchange

Development

Running Tests

python -m unittest discover tests

or with uv from the project root:

uv run pytest

License

See the LICENSE file for details.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citable_corpus-0.1.0.tar.gz (116.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citable_corpus-0.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file citable_corpus-0.1.0.tar.gz.

File metadata

  • Download URL: citable_corpus-0.1.0.tar.gz
  • Upload date:
  • Size: 116.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for citable_corpus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4c66cd85dc19218e3f271a68128d80e972ce89ea710dc21a5ad0247282fd1a1b
MD5 3d6b3b36ca7653c932d54fed4e4b827a
BLAKE2b-256 e2ed941a798f42d3ad1264ae28d45d185396a017b2606bb793da4250ab6d67b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for citable_corpus-0.1.0.tar.gz:

Publisher: publish.yml on neelsmith/citable_corpus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file citable_corpus-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: citable_corpus-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for citable_corpus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad8b2a50dfa88fea1f53020c63c013b12670f6916e78b1641ed77c4c53bb89b7
MD5 ad736ac4914feeb47e9649e799f11f94
BLAKE2b-256 97dca0d3bb9094e65e50c1fafe0d3a95064641b639627701a769fa3d10a4a03e

See more details on using hashes here.

Provenance

The following attestation bundles were made for citable_corpus-0.1.0-py3-none-any.whl:

Publisher: publish.yml on neelsmith/citable_corpus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page