Strategies for Efficient Data Embedding: Creating Embeddings Optimized for Accuracy - Creating Embeddings Optimized for Storage
Project description
Strategies for Efficient Data Embedding
Two approaches to generating optimized embeddings.
- Creating Embeddings Optimized for Accuracy If you’re optimizing for accuracy, a good practice is to first summarize the entire document, then store the summary text and the embedding together. For the rest of the document, you can simply create overlapping chunks and store the embedding and the chunk text together.
- Creating Embeddings Optimized for Storage If you’re optimizing for space, you can chunk the data, summarize each chunk, concatenate all the summarizations, then create an embedding for the final summary.
Example
import os
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from openai import OpenAI
from embedding_optimizer.optimizer import EmbeddingOptimizer
# Set your OpenAI API Key
os.environ['OPENAI_API_KEY'] = ''
# Load your document
raw_document = TextLoader('test_data.txt').load()
# If your document is long, you might want to split it into chunks
text_splitter = CharacterTextSplitter(separator=".", chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_document)
embedding_optimizer = EmbeddingOptimizer(openai_api_key='')
# documents_optimizer = embedding_optimizer.optimized_documents_for_storage(raw_document[0].page_content, documents)
documents_optimizer = embedding_optimizer.optimized_documents_for_accuracy(raw_document[0].page_content, documents)
# Embed the document chunks and the summary
embedding_model = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
db = FAISS.from_documents(documents_optimizer, embedding_model)
# query it
query = "What motivated Alex to create the Function of Everything (FoE)?"
docs = db.similarity_search(query)
print(docs[0].page_content)
Issues
Feel free to submit issues and enhancement requests.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for embedding_optimizer-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 540a6642d2eaa78d5781b1534501b6c5ae352ae74fc309798eba444337055cc8 |
|
MD5 | 32cee2fcced3055d2360d6facaccdc6e |
|
BLAKE2b-256 | 374d68f23345ac1df03c8c570b131635d58a5b3c75ab2a8e49a9f0c433f80f05 |
Close
Hashes for embedding_optimizer-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 964a6a2b01a17186c24eb8b8fa1ebbec346ebd0cc677ea6001029f1b592370c0 |
|
MD5 | 2b8da74b856db7f8c7a2dcdb3ccc6603 |
|
BLAKE2b-256 | c2359b02426a7a27bcce5e040f7117274b40ee0c0ccdb30e7d6772bfb2770313 |