A Python library for working with a corpus of texts canonically citable by CtsUrn.
Project description
citable_corpus
A Python library for working with a corpus of texts canonically citable by CTS URN references.
Overview
citable_corpus lets you work with texts citable by CTS URNs(Canonical Text Services URN).
Features
- Multiple input formats: Create corpora from delimited strings, or from files or URLs with data in CEX format.
- Retrieval based on URN logic: Querying passages by URN recognizes work and passage hierarchies, as well as passage ranges.
- CEX support: Native support for the CEX (CITE Exchange) format
- Type-safe: Built with Pydantic for robust data validation
Installation
pip install citable_corpus
Quick Start
Creating a Corpus
From a delimited string
from citable_corpus import CitableCorpus
text = """urn:cts:latinLit:phi0959.phi006:1.1|Lorem ipsum
urn:cts:latinLit:phi0959.phi006:1.2|Dolor sit amet."""
corpus = CitableCorpus.from_string(text)
print(f"Loaded {len(corpus.passages)} passages")
From a CEX file
corpus = CitableCorpus.from_cex_file("path/to/file.cex")
From a URL
url = "https://example.com/corpus.cex"
corpus = CitableCorpus.from_cex_url(url)
Working with Passages
Each passage in a corpus is a CitablePassage object with a URN and text:
passage = corpus.passages[0]
print(passage.urn) # The CtsUrn object
print(passage.text) # The text content
print(str(passage)) # "urn:...: text"
Retrieving Passages
Retrieve a single passage by exact URN
from urn_citation import CtsUrn
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:pr.1")
results = corpus.retrieve(ref)
Retrieve all passages from a work section
# Get all passages from the preface (pr.) section
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:pr")
results = corpus.retrieve(ref)
Retrieve all passages from a work
# Get all passages from the work (note the trailing colon)
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:")
results = corpus.retrieve(ref)
Retrieve a range of passages
# Get passages from pr.1 through pr.5
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:pr.1-pr.5")
results = corpus.retrieve_range(ref)
# Or use retrieve(), which automatically detects ranges
results = corpus.retrieve(ref)
API Reference
CitableCorpus
The main class for working with a corpus of citable texts.
Class Methods:
from_string(s: str, delimiter: str = "|")- Create from delimited textfrom_cex_file(f: str, delimiter: str = "|")- Create from a CEX filefrom_cex_url(url: str, delimiter: str = "|")- Create from a URL
Instance Methods:
retrieve(ref: CtsUrn)- Retrieve passages matching a URN referenceretrieve_range(ref: CtsUrn)- Retrieve passages in a URN rangelen()- Get the number of passages in the corpus
Attributes:
passages: List[CitablePassage]- The list of passages in the corpus
CitablePassage
Represents a single citable passage of text.
Class Methods:
from_string(src: str, delimiter: str = "|")- Create from a delimited string
Attributes:
urn: CtsUrn- The CTS URN identifying this passagetext: str- The text content of the passage
Examples
Filtering and Processing
# Find all passages containing a specific word
matches = [p for p in corpus.passages if "Zeus" in p.text]
# Get URNs of all passages
urns = [p.urn for p in corpus.passages]
# Count passages by work
from collections import Counter
works = Counter(p.urn.work for p in corpus.passages)
Working with CEX Data
The library supports the CEX (CITE Exchange) format, commonly used in digital classics:
# Load a CEX file with Hyginus fables
corpus = CitableCorpus.from_cex_file("hyginus.cex")
# Retrieve text content of a specific passage
ref = CtsUrn.from_string("urn:cts:latinLit:stoa1263.stoa001.hc:1pr.1")
psg = corpus.retrieve(ref)[0]
print(psg.text)
Requirements
- Python >= 3.14
- pydantic
- urn-citation >= 0.4.1
- cite-exchange
Development
Running Tests
python -m unittest discover tests
or with uv from the project root:
uv run pytest
License
See the LICENSE file for details.
Related Projects
- urn-citation - Python implementation of CTS URNs
- cite-exchange - Python library for CEX format
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file citable_corpus-0.2.0.tar.gz.
File metadata
- Download URL: citable_corpus-0.2.0.tar.gz
- Upload date:
- Size: 184.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f914c93739846970284fbb20dd26b0c6750b4a8554d89f799c0e46d86a696f7
|
|
| MD5 |
5c26b9ad5e1c6a5d6a5d7c4c93c667d1
|
|
| BLAKE2b-256 |
c8cd1831322f11e094513ec953ea26473e06242ef78c96c02ff8bfba29d55539
|
Provenance
The following attestation bundles were made for citable_corpus-0.2.0.tar.gz:
Publisher:
publish.yml on neelsmith/citable_corpus
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
citable_corpus-0.2.0.tar.gz -
Subject digest:
9f914c93739846970284fbb20dd26b0c6750b4a8554d89f799c0e46d86a696f7 - Sigstore transparency entry: 976076356
- Sigstore integration time:
-
Permalink:
neelsmith/citable_corpus@74ec4fec42771f4b49559a4ab072877fa8006daa -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/neelsmith
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@74ec4fec42771f4b49559a4ab072877fa8006daa -
Trigger Event:
release
-
Statement type:
File details
Details for the file citable_corpus-0.2.0-py3-none-any.whl.
File metadata
- Download URL: citable_corpus-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2ec46d004d704873c345623a0cd0b82c13e4f379c99eea9bb19e30c9cceb34b
|
|
| MD5 |
5394af1b5635e5a63d7fa9c5377a243c
|
|
| BLAKE2b-256 |
0359eb8fd5c60b2f60264026832dd90c9a36f76b0947378ff20f85b0ccf9e2cc
|
Provenance
The following attestation bundles were made for citable_corpus-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on neelsmith/citable_corpus
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
citable_corpus-0.2.0-py3-none-any.whl -
Subject digest:
a2ec46d004d704873c345623a0cd0b82c13e4f379c99eea9bb19e30c9cceb34b - Sigstore transparency entry: 976076359
- Sigstore integration time:
-
Permalink:
neelsmith/citable_corpus@74ec4fec42771f4b49559a4ab072877fa8006daa -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/neelsmith
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@74ec4fec42771f4b49559a4ab072877fa8006daa -
Trigger Event:
release
-
Statement type: