Skip to main content

GPU-accelerated semantic similarity and verse resonance explorer.

Project description

echoverse

License Python Version GPU Accelerated

Find hidden echoes across massive text corpora—with GPU power.


Table of Contents


Overview

echoverse is a Python module and CLI tool for discovering semantically similar pairs (“echoes”) in large collections of text. Whether you're analyzing verse, literature, or academic works, echoverse uses GPU acceleration (CUDA) to make it feasible to compare millions or billions of text pairs in minutes.

Use it to uncover thematic resonance, detect plagiarism, power search engines, or build next-gen literary analysis tools.


Features

  • All-pairs semantic similarity: Find every matching text pair above a given threshold.
  • GPU acceleration: Built with CUDA and NumPy for extreme performance.
  • 💾 Flexible I/O: Accepts any embedding model, exports to clean CSV format.
  • 🚀 CLI & Library ready: Use as a standalone tool or integrate into your Python workflow.
  • 🔧 Batch-safe: Handles large-scale embeddings with chunking and memory control.
  • ⚖️ Configurable: Tune thresholds, verbosity, filtering, and more.

Installation

pip install echoverse
# Or clone from source:
# git clone https://github.com/buadofalbhain/echoverse.git
# cd echoverse
# pip install .

Requirements

  • Python 3.8+
  • PyCUDA
  • NumPy
  • (Optional) tqdm for progress bars

⚠️ Requires a CUDA-capable NVIDIA GPU (Compute Capability ≥ 6.1).


Use Cases

  • Plagiarism Detection & Proof of Ownership: Detect semantically similar passages, even when reworded. Prove authorship by tracing echoes of original work across other texts.
  • Literary Analysis & Intertextuality: Explore hidden connections between verses, books, or traditions. Build resonance maps between authors, genres, or historical periods.
  • Content Recommendation: Suggest similar articles, verses, or ideas based on deep meaning.
  • Dataset Deduplication & Clustering: Eliminate redundancy and group similar entries intelligently.
  • Semantic Search & Retrieval: Power AI-enhanced search engines for textual archives.

Quick Example

from echoverse import compute_all_pairs_batched_gpu, normalize_embeddings
import numpy as np

# Load and normalize your embeddings
embeddings = np.load("my_corpus_embeddings.npy")
embeddings = normalize_embeddings(embeddings)

# Find all pairs above 0.85 similarity
results = compute_all_pairs_batched_gpu(embeddings, threshold=0.85)

# results is a NumPy structured array: (index1, index2, similarity)

CLI Usage

python -m echoverse_cli \
  --input my_embeddings.json \
  --output echoes.csv \
  --threshold 0.85 \
  --mode allpairs

Output Format

The output CSV contains:

Column Description
ID1 Index or ID of the first text/verse
ID2 Index or ID of the second text/verse
Similarity Cosine similarity score (float)
Text1 (opt.) Text of the first item (if available)
Text2 (opt.) Text of the second item (if available)

Example:

ID1,ID2,Similarity,Text1,Text2
42,311,0.876,"In the beginning...","And so it was..."
...

Benchmarks

Dataset Size Pairs Compared Runtime (A100 GPU) Notes
10k ~50M ~45 seconds Medium corpus
100k ~5B ~12 minutes Large corpus
250k ~31B ~1 hour Bible-scale

🔄 CPU version would take days to weeks for the same tasks.


API Reference

  • normalize_embeddings(np.ndarray) -> np.ndarray
  • compute_all_pairs_batched_gpu(np.ndarray, threshold: float) -> np.ndarray
  • compute_similarity_cuda_filtered(np.ndarray, threshold: float) -> np.ndarray

See docs/ for detailed parameters, modes, and customization options.


Roadmap

  • CPU fallback mode
  • Sparse matrix mode for memory-constrained environments
  • LangChain / HuggingFace integration
  • Interactive web-based visualization of echo networks

Contributing

We welcome contributions of all kinds:

  • New features, bug fixes, and optimization ideas
  • Documentation, tutorials, and example datasets
  • Integrations with external tools or frameworks

See CONTRIBUTING.md to get started.


License

MIT License — do whatever you want, just credit the work.


Acknowledgments

Built with love by the open-source community—powered by CUDA, NumPy, and the spirit of discovery.


<<<<<<< HEAD Ready to dive in? Get started here →

Ready to dive in? Get started here →

3bf7529 (Initial commit for echoverse)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

echoverse-0.1.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

echoverse-0.1.0-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file echoverse-0.1.0.tar.gz.

File metadata

  • Download URL: echoverse-0.1.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for echoverse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4e8808c61a6ce1a0d10d49bda54e061b7fd72813bf1b72f4961e3a893c38e55
MD5 c22a949e4fd55fb8ffcb55c607cd7dbb
BLAKE2b-256 05ef6f4ffa9a5899c7f5aebb85beb4c8bd0f1c394fcad9852e2385d79e47ef93

See more details on using hashes here.

File details

Details for the file echoverse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: echoverse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for echoverse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 621cf3ffaeb6b59aa5dd7a569edb59288fb83666af5977c1d2e2bd0d8569d8cb
MD5 65780ca462b27b5eae8e553c1cc6c7e9
BLAKE2b-256 b573e23f45494733d80ffd88d9ff2a075e0669df4c74ae9aaa198c18e11e5640

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page