A module for generating embeddings for batches of texts using a SentenceTransformer model.
Project description
EmbeddingGenerator Documentation
Overview
The EmbeddingGenerator class is designed to efficiently generate embeddings for a list of input texts using a model such as SentenceTransformer. It manages the process of splitting texts into manageable chunks, embedding them while considering token limits, and saving the results incrementally to avoid memory issues. The class is optimized to handle large datasets by processing texts in batches and managing resources effectively.
Usage Example
Here's how you might use the EmbeddingGenerator:
from sentence_transformers import SentenceTransformer
# Initialize your model
model = SentenceTransformer('all-MiniLM-L6-v2') # Replace with your model
# Define model settings
model_settings = {
'convert_to_tensor': True,
'device': 'cuda', # or 'cpu'
'show_progress_bar': False
}
# Create an instance of EmbeddingGenerator
embedding_generator = EmbeddingGenerator(model, model_settings, save_path='data')
# Prepare your texts as an iterator
texts = iter([
"This is the first text.",
"Here's the second text.",
# ... add more texts
])
# Generate embeddings
embeddings = embedding_generator(texts)
# Output embeddings
print(embeddings)
What Happens During Execution
-
Initialization:
- The
EmbeddingGeneratoris initialized with a model, model settings, and a save path. - It sets up internal structures for managing texts, embeddings, and progress tracking.
- The
-
Text Loading and Memory Management:
- Texts are loaded from the provided iterator using the
fill_textsmethod. - The class dynamically loads texts while monitoring memory usage to prevent exceeding
max_memory_usage.
- Texts are loaded from the provided iterator using the
-
Text Chunking:
- Texts are split into chunks using
RecursiveCharacterTextSplitterbased on the model'smax_seq_length. - The splitter ensures chunks are appropriately sized for the model to process efficiently.
- Texts are split into chunks using
-
Token Counting:
- The
TokenCounterestimates the number of tokens in each chunk. - This information is used to manage batch sizes and ensure they fit within token limits.
- The
-
Batch Selection:
- The
find_best_chunksmethod selects chunks to process in the next batch, maximizing batch sizes without exceeding limits. - Chunks are sorted and selected based on their token counts.
- The
-
Embedding Generation:
- The
embedmethod processes the selected chunks using the model. - Embeddings are generated and associated with their respective chunks.
- The
-
Error Handling and Token Limit Adjustment:
- If a
RuntimeErroroccurs (e.g., out-of-memory error), thefailmethod adjusts the token limit to prevent future errors. - Successful batches inform the
succeedmethod to update the token limit estimator positively.
- If a
-
Saving Progress:
- Embeddings and metadata are saved incrementally using the
save_datamethod. - Data is saved per text to individual files to avoid loading large JSON files entirely.
- Embeddings and metadata are saved incrementally using the
-
Resource Cleanup:
- Completed texts are removed from memory using the
remove_completed_textsmethod. - This ensures efficient memory usage throughout the process.
- Completed texts are removed from memory using the
-
Final Output Generation:
- Upon completion,
load_average_embeddings_with_fallbackis called to compile the average embeddings for each text. - The output is a dictionary mapping each text to its average embedding or
Noneif unavailable.
- Upon completion,
Output
The output of the EmbeddingGenerator is a dictionary where each key is an input text, and the value is one of the following:
- List of Floats: The average embedding for the text, represented as a list of floats.
None: Indicates that the embedding for the text could not be generated or is missing.
Example Output
{
"This is the first text.": [0.234, -0.987, 0.123, ...], # Embedding vector
"Here's the second text.": [0.456, -0.654, 0.789, ...] # Embedding vector
}
Notes for Users
-
File Structure:
- The
EmbeddingGeneratorsaves data in a structured directory:data/ ├── embeddings_index.json └── embeddings_data/ ├── <text_id1>.json └── <text_id2>.json - Each text's data is saved in a separate JSON file, preventing the need to load large files into memory.
- The
-
Memory Efficiency:
- Designed to handle large datasets by managing memory usage and saving progress incrementally.
- Texts are removed from memory once processed to conserve resources.
-
Resumable Processing:
- If the process is interrupted, it can be resumed, and the class will continue from where it left off, avoiding recomputation.
-
GPU Utilization:
- Attempts to maximize GPU utilization by processing large batches without exceeding memory limits.
- Adjusts batch sizes dynamically based on successful and failed attempts.
-
Error Handling:
- Handles out-of-memory errors gracefully by adjusting token limits and retrying with smaller batches.
-
Missing Data:
- If any embeddings are missing, the output dictionary will contain
Nonefor those texts.
- If any embeddings are missing, the output dictionary will contain
Advanced Usage and Customization
Adjusting Chunk Size
- By default, the chunk size is set based on the model's
max_seq_length. - You can customize the chunk size if needed:
embedding_generator.text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, # Desired chunk size chunk_overlap=20, length_function=embedding_generator.token_counter, )
Handling Large Batches
-
Increase the initial token limit to allow larger batches:
embedding_generator.limit_estimator = TokenLimitEstimator(initial_limit=2048)
-
Adjust the model settings to change the batch size:
model_settings = { 'convert_to_tensor': True, 'device': 'cuda', 'show_progress_bar': False, 'batch_size': 64 # Adjust as per your GPU capacity }
Monitoring Progress
- The class uses
tqdmto display a progress bar during processing. - You can access or customize it via
embedding_generator.progress_bar.
Running Embedding Generation from an Input File
The run_embedding.py script accepts an input file containing texts in various formats.
Supported Input File Formats
- Plain Text (
.txt): Each line is treated as a separate text. - JSON (
.json): The file can contain a list or dictionary of texts. - CSV (
.csv): Each row's first column is treated as a text.
Example Usage
python scripts/run_embedding.py --input-file "path/to/texts.json" --model-path "path/to/model" --save-path "data" --device "cuda"
Command-Line Options
--input-fileor-i: Path to the input file.--model-pathor-m: Path to the SentenceTransformer model.--save-pathor-o: Directory where embeddings will be saved. Defaults todata.--deviceor-d: Device to use ('cpu'or'cuda'). Defaults to'cpu'.
Notes
- Ensure the input file is properly formatted according to its extension.
- The embeddings are saved incrementally in the specified save path.
- The script handles large datasets efficiently, but ensure sufficient disk space is available.
Conclusion
The EmbeddingGenerator is a robust tool for generating embeddings for large datasets, designed with efficiency and scalability in mind. By managing resources effectively, handling errors gracefully, and providing mechanisms for customization, it ensures that embedding generation tasks can be performed reliably, even with extensive datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embedding_generator-0.1.0.tar.gz.
File metadata
- Download URL: embedding_generator-0.1.0.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-44-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f784c7c18aa018178a03253e3234ba588ea0e617c5aed519ff8e617cc2d3e8b8
|
|
| MD5 |
ff29fc62b72189ce3210b2130c51da7e
|
|
| BLAKE2b-256 |
35eadfd79a187e80ff10093aba5f3639f1eb5fd07722ccd2a0b23c208ee91914
|
File details
Details for the file embedding_generator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: embedding_generator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-44-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40995c1b12a65a2a3c208994ee4087541a92980e2a7459ce23af6a7a1e398c7a
|
|
| MD5 |
a0cbdd8cbc1515001c43ce794cdcd117
|
|
| BLAKE2b-256 |
704a05f5656285accee233b051f3bee1b341f557ac5734cec36f60d5645af0f9
|