A Python package to compress text embeddings using various dimensionality reduction and quantization techniques, and evaluate their quality with a comprehensive intrinsic framework including the EOSk metric.

These details have not been verified by PyPI

Project links

Project description

TextEmbedCompress

TextEmbedCompress is a Python toolkit designed for compressing text embeddings and rigorously evaluating their quality using a comprehensive suite of intrinsic, task-agnostic metrics. It features several dimensionality reduction techniques, int8 quantization, and a novel spectral fidelity measure, $EOS_k$, to assess the preservation of semantic structure beyond dominant variance components.

This framework allows researchers and practitioners to make informed decisions about embedding compression strategies, balancing computational efficiency with the preservation of meaningful information.

Key Features

Embedding Model Support:
- Easily load and use models from the Sentence Transformers library.
- Support for static word embedding models like GloVe, Word2Vec, and FastText (sentence embeddings derived via averaging).
Compression Techniques:
- Dimensionality Reduction (DR):
  - Principal Component Analysis (PCA)
  - Independent Component Analysis (ICA)
  - Gaussian Random Projections (RP)
  - Factor Analysis (FA)
  - Uniform Manifold Approximation and Projection (UMAP)
  - Pairwise Controlled Manifold Approximation (PaCMAP)
- Quantization:
  - Symmetric int8 per-tensor quantization.
Comprehensive Intrinsic Evaluation Framework:
- Local Neighborhood Fidelity:
  - Trustworthiness ($T_k$)
  - Continuity ($C_k$)
  - Mean Relative Rank Error ($MRRE_k$)
  - Neighborhood Precision at k ($NP_k$)
  - Local Average Procrustes Measure (LPro)
- Global Geometry Fidelity:
  - Kruskal's Stress (KS)
  - Spearman Distance Correlation (SDC)
  - Pearson Distance Correlation (PDC)
  - Global Procrustes Measure (GPro)
- Spectral Retention / Information Fidelity:
  - Explained Variance Ratio (EVR)
  - Pairwise Inner-Product (PIP) Loss
  - Eigenspace Overlap Score (EOS)
  - Novel $EOS_k$ (Residual Eigenspace Overlap Score): Our proposed metric to evaluate semantic preservation after removing top-k principal components, offering a more nuanced view of information retention.
Easy-to-Use Pipeline: A streamlined EmbeddingPipeline class to manage the workflow from embedding generation to compression and evaluation.
Reproducibility: Control over random states for stochastic DR methods.

Installation

You can install TextEmbedCompress directly from PyPI:

pip install TextEmbedCompress

Quick-Start

Here's a simple example of how to use TextEmbedCompress:

from textembedcompress import EmbeddingPipeline 
import json

# 1. Initialize the pipeline with a sentence-transformer model
# This will automatically use CUDA if available, otherwise CPU.
pipeline = EmbeddingPipeline(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2")

# 2. Define your dataset (can be Hugging Face dataset ID, path to local file, or list of texts)
# For this example, we'll use a dummy list of texts.
sample_texts = [
    "This is the first sentence for testing.",
    "Here is another sentence, quite different from the first.",
    "Embeddings are numerical representations of text.",
    "Compressing embeddings can save space and speed up computations.",
    "Evaluation helps to understand the quality of compressed embeddings."
] * 20 # Multiply to get enough samples for DR/evaluation

# 3. Run the full compression and evaluation pipeline
results = pipeline.run(
    dataset_identifier=sample_texts,
    dr_method="pca",
    target_dim=32, # Target dimension after DR
    quantization_method="int8", # Apply int8 quantization
    output_dir="output/all-MiniLM-L6-v2_pca32_int8", # Directory to save results
    k_val_for_local_metrics=5,  # k for T_k, C_k, NP_k, MRRE_k, LPro
    k_for_eos_k=2,              # Number of top components to remove for EOS_k
    n_sub_for_eos_k=10,         # Subspace dimension for EOS_k overlap calculation
    distance_metric_for_eval='cosine', # Distance metric for nn-based evaluations
    random_state_for_dr=42
)

# 4. Access results
print("\n--- Compression Info ---")
# Ensure results["compression_info"] is serializable for json.dumps
serializable_compression_info = {k: str(v) for k, v in results["compression_info"].items()}
print(json.dumps(serializable_compression_info, indent=2))

print("\n--- Evaluation Metrics ---")
serializable_evaluation_metrics = {k: float(v) if isinstance(v, (int, float)) else str(v) for k, v in results["evaluation_metrics"].items()}
print(json.dumps(serializable_evaluation_metrics, indent=2))

print(f"\nCompressed embeddings and metrics saved to: {results['output_location']}")

Usage Details

The core of the package is the EmbeddingPipeline class.

`EmbeddingPipeline(model_name_or_path, device=None, trust_remote_code=True)`

model_name_or_path: Name of a Sentence Transformers model (e.g., "sentence-transformers/all-MiniLM-L6-v2") or path to a static embedding file (e.g., "path/to/glove.840B.300d.txt").
device: Optional; "cpu" or "cuda". Defaults to "cuda" if available, else "cpu".
trust_remote_code: For loading Hugging Face models that require custom code.

Main Methods

.embed(dataset_identifier, text_column_names=None, dataset_split=None, batch_size=32, show_progress_bar=True):
- Loads the specified dataset and generates original embeddings.
- dataset_identifier: Hugging Face dataset name (e.g., "imdb"), a tuple for HF datasets with configs (e.g., ("glue", "mrpc")), path to a local file, or a list of Python strings.
- text_column_names: List of column names containing text in an HF dataset.
- dataset_split: E.g., "train", "test", "train[:10%]".
.compress(dr_method='pca', target_dim=128, quantization_method='int8', ...):
- Applies dimensionality reduction and/or quantization.
- dr_method: One of 'pca', 'ica', 'rp', 'fa', 'umap', 'pacmap', or 'none'.
- target_dim: Desired output dimension after DR.
- quantization_method: 'int8' or 'none'.
.evaluate(distance_metric_for_eval='cosine', k_val_for_local_metrics=10, k_for_eos_k=5, n_sub_for_eos_k=10, ...):
- Computes all evaluation metrics comparing original and compressed embeddings.
- distance_metric_for_eval: Distance used for neighbor-based metrics ('cosine' or 'euclidean').
- k_val_for_local_metrics: Neighborhood size for local metrics like $T_k, C_k, NP_k, LPro$.
- k_for_eos_k: Number of dominant components to remove before $EOS_k$ calculation.
- n_sub_for_eos_k: Subspace dimension for overlap calculation in $EOS_k$.
.run(...): A convenience method that calls embed, compress, evaluate, and save_results sequentially. Takes parameters for all these steps.
.save_results(output_dir):
- Saves compressed embeddings (as .npy), compression info (as .json), and evaluation metrics (as .json) to the specified directory.

The $EOS_k$ Metric

The $EOS_k$ (Residual Eigenspace Overlap Score) is a novel metric designed to assess how well semantic structure is preserved in compressed embeddings after accounting for and removing the influence of the top-$k$ most dominant principal components (directions of highest variance).

The rationale is that these top components often capture broad, sometimes task-agnostic or even noisy, variance that can overshadow more subtle, task-relevant semantic features. $EOS_k$ works by:

Calculating residual embeddings for both original ($\mathbf{X}'$) and compressed ($\mathbf{Z}'$) spaces by subtracting the projections onto their respective top-$k$ right singular vectors.
Performing SVD on these residuals $\mathbf{X}'$ and $\mathbf{Z}'$.
Comparing the alignment of the top $N_{sub}$ principal directions (derived from left singular vectors) of these "cleaned" residual subspaces.
A high $EOS_k$ score indicates that complex, task-relevant structure beyond the dominant components is well-preserved.

Contributing

Contributions are welcome! Whether it's bug reports, feature suggestions, or pull requests, please feel free to engage.

Reporting Issues: Please use the GitHub issue tracker.
Feature Requests: Submit an issue detailing the feature and its use case.
Pull Requests:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Ensure your code adheres to quality standards (e.g., run linters, add tests).
- Submit a pull request with a clear description of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 5, 2025

0.1.0

Jun 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textembedcompress-0.3.0.tar.gz (22.3 kB view details)

Uploaded Jun 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textembedcompress-0.3.0-py3-none-any.whl (24.4 kB view details)

Uploaded Jun 5, 2025 Python 3

File details

Details for the file textembedcompress-0.3.0.tar.gz.

File metadata

Download URL: textembedcompress-0.3.0.tar.gz
Upload date: Jun 5, 2025
Size: 22.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for textembedcompress-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`64eed01577bd6ba3ead2308e1ff49ee178dc91a68706cacb702e3ab5475a5065`
MD5	`aa8b3dccb739e5ab0c3e755fcbd13695`
BLAKE2b-256	`efe3783c788b8f63431261f418bedd20c6b90696fa4736a6881b7d24c102022b`

See more details on using hashes here.

File details

Details for the file textembedcompress-0.3.0-py3-none-any.whl.

File metadata

Download URL: textembedcompress-0.3.0-py3-none-any.whl
Upload date: Jun 5, 2025
Size: 24.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for textembedcompress-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e93725d07320ba44d5b103b3b3dbd0efe1a125e393269343883cbac3633e8c43`
MD5	`1ceed8c2aefaa7344414c9035c4bbfd7`
BLAKE2b-256	`dfc0584c898feb3941c3dbce435f2f5b9c3ac1bb897fbfd40a29629adaebeed9`

See more details on using hashes here.

TextEmbedCompress 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TextEmbedCompress

Key Features

Installation

Quick-Start

Usage Details

`EmbeddingPipeline(model_name_or_path, device=None, trust_remote_code=True)`

Main Methods

The $EOS_k$ Metric

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes