GPU-accelerated semantic similarity and verse resonance explorer.
Project description
echoverse
Find hidden echoes across massive text corpora—with GPU power.
Table of Contents
- Overview
- Features
- Installation
- Requirements
- Use Cases
- Quick Example
- CLI Usage
- Output Format
- Benchmarks
- API Reference
- Roadmap
- Contributing
- License
- Acknowledgments
Overview
echoverse is a Python module and CLI tool for discovering semantically similar pairs (“echoes”) in large collections of text. Whether you're analyzing verse, literature, or academic works, echoverse uses GPU acceleration (CUDA) to make it feasible to compare millions or billions of text pairs in minutes.
Use it to uncover thematic resonance, detect plagiarism, power search engines, or build next-gen literary analysis tools.
Features
- ✨ All-pairs semantic similarity: Find every matching text pair above a given threshold.
- ⚡ GPU acceleration: Built with CUDA and NumPy for extreme performance.
- 💾 Flexible I/O: Accepts any embedding model, exports to clean CSV format.
- 🚀 CLI & Library ready: Use as a standalone tool or integrate into your Python workflow.
- 🔧 Batch-safe: Handles large-scale embeddings with chunking and memory control.
- ⚖️ Configurable: Tune thresholds, verbosity, filtering, and more.
Installation
pip install echoverse
# Or clone from source:
# git clone https://github.com/buadofalbhain/echoverse.git
# cd echoverse
# pip install .
Requirements
- Python 3.8+
- PyCUDA
- NumPy
- (Optional) tqdm for progress bars
⚠️ Requires a CUDA-capable NVIDIA GPU (Compute Capability ≥ 6.1).
Use Cases
- Plagiarism Detection & Proof of Ownership: Detect semantically similar passages, even when reworded. Prove authorship by tracing echoes of original work across other texts.
- Literary Analysis & Intertextuality: Explore hidden connections between verses, books, or traditions. Build resonance maps between authors, genres, or historical periods.
- Content Recommendation: Suggest similar articles, verses, or ideas based on deep meaning.
- Dataset Deduplication & Clustering: Eliminate redundancy and group similar entries intelligently.
- Semantic Search & Retrieval: Power AI-enhanced search engines for textual archives.
Quick Example
from echoverse import compute_all_pairs_batched_gpu, normalize_embeddings
import numpy as np
# Load and normalize your embeddings
embeddings = np.load("my_corpus_embeddings.npy")
embeddings = normalize_embeddings(embeddings)
# Find all pairs above 0.85 similarity
results = compute_all_pairs_batched_gpu(embeddings, threshold=0.85)
# results is a NumPy structured array: (index1, index2, similarity)
CLI Usage
python -m echoverse_cli \
--input my_embeddings.json \
--output echoes.csv \
--threshold 0.85 \
--mode allpairs
Output Format
The output CSV contains:
| Column | Description |
|---|---|
| ID1 | Index or ID of the first text/verse |
| ID2 | Index or ID of the second text/verse |
| Similarity | Cosine similarity score (float) |
| Text1 (opt.) | Text of the first item (if available) |
| Text2 (opt.) | Text of the second item (if available) |
Example:
ID1,ID2,Similarity,Text1,Text2
42,311,0.876,"In the beginning...","And so it was..."
...
Benchmarks
| Dataset Size | Pairs Compared | Runtime (A100 GPU) | Notes |
|---|---|---|---|
| 10k | ~50M | ~45 seconds | Medium corpus |
| 100k | ~5B | ~12 minutes | Large corpus |
| 250k | ~31B | ~1 hour | Bible-scale |
🔄 CPU version would take days to weeks for the same tasks.
API Reference
normalize_embeddings(np.ndarray) -> np.ndarraycompute_all_pairs_batched_gpu(np.ndarray, threshold: float) -> np.ndarraycompute_similarity_cuda_filtered(np.ndarray, threshold: float) -> np.ndarray
See docs/ for detailed parameters, modes, and customization options.
Roadmap
- CPU fallback mode
- Sparse matrix mode for memory-constrained environments
- LangChain / HuggingFace integration
- Interactive web-based visualization of echo networks
Contributing
We welcome contributions of all kinds:
- New features, bug fixes, and optimization ideas
- Documentation, tutorials, and example datasets
- Integrations with external tools or frameworks
See CONTRIBUTING.md to get started.
License
MIT License — do whatever you want, just credit the work.
Acknowledgments
Built with love by the open-source community—powered by CUDA, NumPy, and the spirit of discovery.
<<<<<<< HEAD Ready to dive in? Get started here →
Ready to dive in? Get started here →
3bf7529 (Initial commit for echoverse)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file echoverse-0.1.0.tar.gz.
File metadata
- Download URL: echoverse-0.1.0.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4e8808c61a6ce1a0d10d49bda54e061b7fd72813bf1b72f4961e3a893c38e55
|
|
| MD5 |
c22a949e4fd55fb8ffcb55c607cd7dbb
|
|
| BLAKE2b-256 |
05ef6f4ffa9a5899c7f5aebb85beb4c8bd0f1c394fcad9852e2385d79e47ef93
|
File details
Details for the file echoverse-0.1.0-py3-none-any.whl.
File metadata
- Download URL: echoverse-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
621cf3ffaeb6b59aa5dd7a569edb59288fb83666af5977c1d2e2bd0d8569d8cb
|
|
| MD5 |
65780ca462b27b5eae8e553c1cc6c7e9
|
|
| BLAKE2b-256 |
b573e23f45494733d80ffd88d9ff2a075e0669df4c74ae9aaa198c18e11e5640
|