LociSimiles is a Python package for finding intertextual links in Latin literature using pre-trained language models.
Project description
Loci Similes
LociSimiles is a Python package for finding intertextual links in Latin literature using pre-trained language models.
Basic Usage
# Load example query and source documents
query_doc = Document("../data/hieronymus_samples.csv")
source_doc = Document("../data/vergil_samples.csv")
# Load the pipeline with pre-trained models
pipeline = ClassificationPipelineWithCandidategeneration(
classification_name="...",
embedding_model_name="...",
device="cpu",
)
# Run the pipeline with the query and source documents
results = pipeline.run(
query=query_doc, # Query document
source=source_doc, # Source document
top_k=3 # Number of top similar candidates to classify
)
pretty_print(results)
# Save results to CSV or JSON
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")
Command-Line Interface
LociSimiles provides a command-line tool for running the pipeline directly from the terminal:
Basic Usage
locisimiles query.csv source.csv -o results.csv
Two-Stage Pipeline Example
locisimiles query.csv source.csv -o results.csv \
--pipeline two-stage \
--classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
--embedding-model julian-schelb/multilingual-e5-large-emb-lat-intertext-v1 \
--top-k 20 \
--threshold 0.85 \
--device cuda \
--verbose
Word2Vec Retrieval Example
locisimiles query.csv source.csv -o results.csv \
--pipeline word2vec-retrieval \
--word2vec-model-path ./models/latin_w2v_bamman_lemma300_100_1.model \
--word2vec-interval 2 \
--word2vec-order-free \
--top-k 20 \
--threshold 0.85
Latin BERT Retrieval Example (Gong-Style)
locisimiles query.csv source.csv -o results.csv \
--pipeline latin-bert-retrieval \
--latin-bert-model ashleygong03/bamman-burns-latin-bert \
--top-k 20 \
--threshold 0.85
If --word2vec-model-path is not provided, the CLI expects a local model at:
models/latin_w2v_bamman_lemma300_100_1.model
Word2Vec mode requires pre-lemmatized input in the same CSV format (seg_id, text).
Options
-
Input/Output:
query: Path to query document CSV file (columns:seg_id,text)source: Path to source document CSV file (columns:seg_id,text)-o, --output: Path to output CSV file for results (required)
-
Models:
--classification-model: HuggingFace model for classification (default: xlm-roberta-large-class-lat-intertext-v1)--embedding-model: HuggingFace model for embeddings (default: multilingual-e5-large-emb-lat-intertext-v1)--word2vec-model-path: Local path to a gensim.modelfile (Word2Vec pipeline)
-
Pipeline Parameters:
--pipeline: Selecttwo-stageorword2vec-retrieval(default:two-stage)-k, --top-k: Number of top candidates to retrieve per query segment (default: 10)-t, --threshold: Decision threshold for output filtering (default: 0.85)--word2vec-interval: Max token gap for Word2Vec bigrams (default: 0)--word2vec-order-free: Enable order-insensitive Word2Vec bigrams
-
Device:
--device: Chooseauto,cuda,mps, orcpu(default: auto-detect)
-
Other:
-v, --verbose: Enable detailed progress output-h, --help: Show help message
Output Format
The CLI saves results to a CSV file with the following columns:
query_id: Query segment identifierquery_text: Query text contentsource_id: Source segment identifiersource_text: Source text contentsimilarity: Cosine similarity score (0-1)probability: Classification confidence (0-1)above_threshold: "Yes" if probability ≥ threshold, otherwise "No"
Optional Gradio GUI
Install the optional GUI extra to experiment with a minimal Gradio front end:
pip install locisimiles[gui]
Launch the interface from the command line:
locisimiles-gui
In the GUI, choose Word2Vec Retrieval (Burns-Style) in Pipeline Configuration to enable Word2Vec controls:
- Word2Vec Model Path: local gensim
.modelfile - Bigram Interval: token gap for bigram generation
- Order-Free Bigrams: optional order-insensitive matching
If the model path is invalid or missing, processing fails with a clear error message.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file locisimiles-1.6.0.tar.gz.
File metadata
- Download URL: locisimiles-1.6.0.tar.gz
- Upload date:
- Size: 64.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a955b8d9c4f8e020252bca55800e16d220e7ce522c9bb3181adc17936efbc4cc
|
|
| MD5 |
34f83f32c1fc1ca7ffba30ac5cf3ae0e
|
|
| BLAKE2b-256 |
95dc7f33f0b328b3e8fe4bf13e7fbb7551ae8e6d5aa5d6bfb2f5ea05fbfac69f
|
Provenance
The following attestation bundles were made for locisimiles-1.6.0.tar.gz:
Publisher:
release.yml on julianschelb/locisimiles
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
locisimiles-1.6.0.tar.gz -
Subject digest:
a955b8d9c4f8e020252bca55800e16d220e7ce522c9bb3181adc17936efbc4cc - Sigstore transparency entry: 1372645238
- Sigstore integration time:
-
Permalink:
julianschelb/locisimiles@625e882560f92353f969e174259a50a7f13cb433 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/julianschelb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@625e882560f92353f969e174259a50a7f13cb433 -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file locisimiles-1.6.0-py3-none-any.whl.
File metadata
- Download URL: locisimiles-1.6.0-py3-none-any.whl
- Upload date:
- Size: 85.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed5f1f1b668d97a2125596faf744deca88d51eb98be346f0d89a8e7b60a1b3e2
|
|
| MD5 |
e611312e9605e0dbfe86fb2887160fab
|
|
| BLAKE2b-256 |
6aa30a82a493c0c23e6aabaf4550bebd82d4c6a85d10375d36f2133919c20b57
|
Provenance
The following attestation bundles were made for locisimiles-1.6.0-py3-none-any.whl:
Publisher:
release.yml on julianschelb/locisimiles
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
locisimiles-1.6.0-py3-none-any.whl -
Subject digest:
ed5f1f1b668d97a2125596faf744deca88d51eb98be346f0d89a8e7b60a1b3e2 - Sigstore transparency entry: 1372645343
- Sigstore integration time:
-
Permalink:
julianschelb/locisimiles@625e882560f92353f969e174259a50a7f13cb433 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/julianschelb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@625e882560f92353f969e174259a50a7f13cb433 -
Trigger Event:
workflow_run
-
Statement type: