Skip to main content

LociSimiles is a Python package for finding intertextual links in Latin literature using pre-trained language models.

Project description

Loci Similes

LociSimiles is a Python package for finding intertextual links in Latin literature using pre-trained language models.

Basic Usage

# Load example query and source documents
query_doc = Document("../data/hieronymus_samples.csv")
source_doc = Document("../data/vergil_samples.csv")

# Load the pipeline with pre-trained models
pipeline = ClassificationPipelineWithCandidategeneration(
    classification_name="...",
    embedding_model_name="...",
    device="cpu",
)

# Run the pipeline with the query and source documents
results = pipeline.run(
    query=query_doc,    # Query document
    source=source_doc,  # Source document
    top_k=3             # Number of top similar candidates to classify
)

pretty_print(results)

# Save results to CSV or JSON
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")

Command-Line Interface

LociSimiles provides a command-line tool for running the pipeline directly from the terminal:

Basic Usage

locisimiles query.csv source.csv -o results.csv

Two-Stage Pipeline Example

locisimiles query.csv source.csv -o results.csv \
  --pipeline two-stage \
  --classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
  --embedding-model julian-schelb/multilingual-e5-large-emb-lat-intertext-v1 \
  --top-k 20 \
  --threshold 0.85 \
  --device cuda \
  --verbose

Word2Vec Retrieval Example

locisimiles query.csv source.csv -o results.csv \
  --pipeline word2vec-retrieval \
  --word2vec-model-path ./models/latin_w2v_bamman_lemma300_100_1.model \
  --word2vec-interval 2 \
  --word2vec-order-free \
  --top-k 20 \
  --threshold 0.85

Latin BERT Retrieval Example (Gong-Style)

locisimiles query.csv source.csv -o results.csv \
  --pipeline latin-bert-retrieval \
  --latin-bert-model ashleygong03/bamman-burns-latin-bert \
  --top-k 20 \
  --threshold 0.85

If --word2vec-model-path is not provided, the CLI expects a local model at:

models/latin_w2v_bamman_lemma300_100_1.model

Word2Vec mode requires pre-lemmatized input in the same CSV format (seg_id, text).

Options

  • Input/Output:

    • query: Path to query document CSV file (columns: seg_id, text)
    • source: Path to source document CSV file (columns: seg_id, text)
    • -o, --output: Path to output CSV file for results (required)
  • Models:

    • --classification-model: HuggingFace model for classification (default: xlm-roberta-large-class-lat-intertext-v1)
    • --embedding-model: HuggingFace model for embeddings (default: multilingual-e5-large-emb-lat-intertext-v1)
    • --word2vec-model-path: Local path to a gensim .model file (Word2Vec pipeline)
  • Pipeline Parameters:

    • --pipeline: Select two-stage or word2vec-retrieval (default: two-stage)
    • -k, --top-k: Number of top candidates to retrieve per query segment (default: 10)
    • -t, --threshold: Decision threshold for output filtering (default: 0.85)
    • --word2vec-interval: Max token gap for Word2Vec bigrams (default: 0)
    • --word2vec-order-free: Enable order-insensitive Word2Vec bigrams
  • Device:

    • --device: Choose auto, cuda, mps, or cpu (default: auto-detect)
  • Other:

    • -v, --verbose: Enable detailed progress output
    • -h, --help: Show help message

Output Format

The CLI saves results to a CSV file with the following columns:

  • query_id: Query segment identifier
  • query_text: Query text content
  • source_id: Source segment identifier
  • source_text: Source text content
  • similarity: Cosine similarity score (0-1)
  • probability: Classification confidence (0-1)
  • above_threshold: "Yes" if probability ≥ threshold, otherwise "No"

Optional Gradio GUI

Install the optional GUI extra to experiment with a minimal Gradio front end:

pip install locisimiles[gui]

Launch the interface from the command line:

locisimiles-gui

In the GUI, choose Word2Vec Retrieval (Burns-Style) in Pipeline Configuration to enable Word2Vec controls:

  • Word2Vec Model Path: local gensim .model file
  • Bigram Interval: token gap for bigram generation
  • Order-Free Bigrams: optional order-insensitive matching

If the model path is invalid or missing, processing fails with a clear error message.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locisimiles-1.6.0.tar.gz (64.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locisimiles-1.6.0-py3-none-any.whl (85.5 kB view details)

Uploaded Python 3

File details

Details for the file locisimiles-1.6.0.tar.gz.

File metadata

  • Download URL: locisimiles-1.6.0.tar.gz
  • Upload date:
  • Size: 64.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for locisimiles-1.6.0.tar.gz
Algorithm Hash digest
SHA256 a955b8d9c4f8e020252bca55800e16d220e7ce522c9bb3181adc17936efbc4cc
MD5 34f83f32c1fc1ca7ffba30ac5cf3ae0e
BLAKE2b-256 95dc7f33f0b328b3e8fe4bf13e7fbb7551ae8e6d5aa5d6bfb2f5ea05fbfac69f

See more details on using hashes here.

Provenance

The following attestation bundles were made for locisimiles-1.6.0.tar.gz:

Publisher: release.yml on julianschelb/locisimiles

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file locisimiles-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: locisimiles-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 85.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for locisimiles-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed5f1f1b668d97a2125596faf744deca88d51eb98be346f0d89a8e7b60a1b3e2
MD5 e611312e9605e0dbfe86fb2887160fab
BLAKE2b-256 6aa30a82a493c0c23e6aabaf4550bebd82d4c6a85d10375d36f2133919c20b57

See more details on using hashes here.

Provenance

The following attestation bundles were made for locisimiles-1.6.0-py3-none-any.whl:

Publisher: release.yml on julianschelb/locisimiles

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page