A module for generating embeddings for batches of texts using a SentenceTransformer model.
Project description
EmbeddingGenerator
Documentation
Overview
The EmbeddingGenerator
class is designed to efficiently generate embeddings for a list of input texts using a model such as SentenceTransformer
. It manages the process of splitting texts into manageable chunks, embedding them while considering token limits, and saving the results incrementally to avoid memory issues. The class is optimized to handle large datasets by processing texts in batches and managing resources effectively.
Usage Example
Here's how you might use the EmbeddingGenerator
:
from sentence_transformers import SentenceTransformer
# Initialize your model
model = SentenceTransformer('all-MiniLM-L6-v2') # Replace with your model
# Define model settings
model_settings = {
'convert_to_tensor': True,
'device': 'cuda', # or 'cpu'
'show_progress_bar': False
}
# Create an instance of EmbeddingGenerator
embedding_generator = EmbeddingGenerator(model, model_settings, save_path='data')
# Prepare your texts as an iterator
texts = iter([
"This is the first text.",
"Here's the second text.",
# ... add more texts
])
# Generate embeddings
embeddings = embedding_generator(texts)
# Output embeddings
print(embeddings)
What Happens During Execution
-
Initialization:
- The
EmbeddingGenerator
is initialized with a model, model settings, and a save path. - It sets up internal structures for managing texts, embeddings, and progress tracking.
- The
-
Text Loading and Memory Management:
- Texts are loaded from the provided iterator using the
fill_texts
method. - The class dynamically loads texts while monitoring memory usage to prevent exceeding
max_memory_usage
.
- Texts are loaded from the provided iterator using the
-
Text Chunking:
- Texts are split into chunks using
RecursiveCharacterTextSplitter
based on the model'smax_seq_length
. - The splitter ensures chunks are appropriately sized for the model to process efficiently.
- Texts are split into chunks using
-
Token Counting:
- The
TokenCounter
estimates the number of tokens in each chunk. - This information is used to manage batch sizes and ensure they fit within token limits.
- The
-
Batch Selection:
- The
find_best_chunks
method selects chunks to process in the next batch, maximizing batch sizes without exceeding limits. - Chunks are sorted and selected based on their token counts.
- The
-
Embedding Generation:
- The
embed
method processes the selected chunks using the model. - Embeddings are generated and associated with their respective chunks.
- The
-
Error Handling and Token Limit Adjustment:
- If a
RuntimeError
occurs (e.g., out-of-memory error), thefail
method adjusts the token limit to prevent future errors. - Successful batches inform the
succeed
method to update the token limit estimator positively.
- If a
-
Saving Progress:
- Embeddings and metadata are saved incrementally using the
save_data
method. - Data is saved per text to individual files to avoid loading large JSON files entirely.
- Embeddings and metadata are saved incrementally using the
-
Resource Cleanup:
- Completed texts are removed from memory using the
remove_completed_texts
method. - This ensures efficient memory usage throughout the process.
- Completed texts are removed from memory using the
-
Final Output Generation:
- Upon completion,
load_average_embeddings_with_fallback
is called to compile the average embeddings for each text. - The output is a dictionary mapping each text to its average embedding or
None
if unavailable.
- Upon completion,
Output
The output of the EmbeddingGenerator
is a dictionary where each key is an input text, and the value is one of the following:
- List of Floats: The average embedding for the text, represented as a list of floats.
None
: Indicates that the embedding for the text could not be generated or is missing.
Example Output
{
"This is the first text.": [0.234, -0.987, 0.123, ...], # Embedding vector
"Here's the second text.": [0.456, -0.654, 0.789, ...] # Embedding vector
}
Notes for Users
-
File Structure:
- The
EmbeddingGenerator
saves data in a structured directory:data/ ├── embeddings_index.json └── embeddings_data/ ├── <text_id1>.json └── <text_id2>.json
- Each text's data is saved in a separate JSON file, preventing the need to load large files into memory.
- The
-
Memory Efficiency:
- Designed to handle large datasets by managing memory usage and saving progress incrementally.
- Texts are removed from memory once processed to conserve resources.
-
Resumable Processing:
- If the process is interrupted, it can be resumed, and the class will continue from where it left off, avoiding recomputation.
-
GPU Utilization:
- Attempts to maximize GPU utilization by processing large batches without exceeding memory limits.
- Adjusts batch sizes dynamically based on successful and failed attempts.
-
Error Handling:
- Handles out-of-memory errors gracefully by adjusting token limits and retrying with smaller batches.
-
Missing Data:
- If any embeddings are missing, the output dictionary will contain
None
for those texts.
- If any embeddings are missing, the output dictionary will contain
Advanced Usage and Customization
Adjusting Chunk Size
- By default, the chunk size is set based on the model's
max_seq_length
. - You can customize the chunk size if needed:
embedding_generator.text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, # Desired chunk size chunk_overlap=20, length_function=embedding_generator.token_counter, )
Handling Large Batches
-
Increase the initial token limit to allow larger batches:
embedding_generator.limit_estimator = TokenLimitEstimator(initial_limit=2048)
-
Adjust the model settings to change the batch size:
model_settings = { 'convert_to_tensor': True, 'device': 'cuda', 'show_progress_bar': False, 'batch_size': 64 # Adjust as per your GPU capacity }
Monitoring Progress
- The class uses
tqdm
to display a progress bar during processing. - You can access or customize it via
embedding_generator.progress_bar
.
Running Embedding Generation from an Input File
The run_embedding.py
script accepts an input file containing texts in various formats.
Supported Input File Formats
- Plain Text (
.txt
): Each line is treated as a separate text. - JSON (
.json
): The file can contain a list or dictionary of texts. - CSV (
.csv
): Each row's first column is treated as a text.
Example Usage
python scripts/run_embedding.py --input-file "path/to/texts.json" --model-path "path/to/model" --save-path "data" --device "cuda"
Command-Line Options
--input-file
or-i
: Path to the input file.--model-path
or-m
: Path to the SentenceTransformer model.--save-path
or-o
: Directory where embeddings will be saved. Defaults todata
.--device
or-d
: Device to use ('cpu'
or'cuda'
). Defaults to'cpu'
.
Notes
- Ensure the input file is properly formatted according to its extension.
- The embeddings are saved incrementally in the specified save path.
- The script handles large datasets efficiently, but ensure sufficient disk space is available.
Conclusion
The EmbeddingGenerator
is a robust tool for generating embeddings for large datasets, designed with efficiency and scalability in mind. By managing resources effectively, handling errors gracefully, and providing mechanisms for customization, it ensures that embedding generation tasks can be performed reliably, even with extensive datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file embedding_generator-0.1.0.tar.gz
.
File metadata
- Download URL: embedding_generator-0.1.0.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-44-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f784c7c18aa018178a03253e3234ba588ea0e617c5aed519ff8e617cc2d3e8b8 |
|
MD5 | ff29fc62b72189ce3210b2130c51da7e |
|
BLAKE2b-256 | 35eadfd79a187e80ff10093aba5f3639f1eb5fd07722ccd2a0b23c208ee91914 |
File details
Details for the file embedding_generator-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: embedding_generator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-44-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40995c1b12a65a2a3c208994ee4087541a92980e2a7459ce23af6a7a1e398c7a |
|
MD5 | a0cbdd8cbc1515001c43ce794cdcd117 |
|
BLAKE2b-256 | 704a05f5656285accee233b051f3bee1b341f557ac5734cec36f60d5645af0f9 |