A library for safe document storage and vectorization.
Project description
safe_store
safe_store Library - TextVectorizer Class
The TextVectorizer
class is a part of the safe_store
library, which is available on PyPI and is released under the Apache 2.0 license. This class provides functionality for text vectorization using various methods, such as TF-IDF vectorization or model-based embedding. It also offers features for document decomposition, visualization, and querying.
Installation
You can install the safe_store
library using pip:
pip install safe_store
Usage
To use the TextVectorizer
class, you need to import it from the library and create an instance. Here's an example of how to use it:
from safe_store import TextVectorizer, VectorizationMethod
# Create an instance of TextVectorizer
vectorizer = TextVectorizer(
vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,
database_path="database.json",
save_db=True,
visualize_data_at_startup=True,
visualize_data_at_add_file=True,
visualize_data_at_generate=True
)
# Add a document for vectorization
document_name = "example.txt"
text = "This is an example document for vectorization."
vectorizer.add_document(document_name, text, chunk_size=100, overlap_size=20, force_vectorize=False, add_as_a_bloc=False)
# Index the documents (perform vectorization)
vectorizer.index()
# Embed a query and retrieve similar documents
query_text = "vectorization"
query_embedding = vectorizer.embed_query(query_text)
similar_texts, _ = vectorizer.recover_text(query_embedding, top_k=3)
print("Similar Documents:")
for i, text in enumerate(similar_texts):
print(f"{i + 1}: {text}")
Constructor Parameters
vectorization_method
: Specify the vectorization method to use. Options areVectorizationMethod.MODEL_EMBEDDING
orVectorizationMethod.TFIDF_VECTORIZER
.model
: Provide a model instance when using model-based embedding (required ifvectorization_method
isVectorizationMethod.MODEL_EMBEDDING
).database_path
: Path to the JSON database file where vectorized data is stored.save_db
: Boolean to determine whether to save vectorized data to the database file.visualize_data_at_startup
: Boolean to enable visualization of data at startup.visualize_data_at_add_file
: Boolean to enable visualization of data when adding a file.visualize_data_at_generate
: Boolean to enable visualization of data when generating embeddings.data_visualization_method
: Specify the visualization method for data. Options are "PCA" or "t-SNE".database_dict
: Optional dictionary to initialize theTextVectorizer
state from a previous session.
Methods
add_document
: Add a document for vectorization.index
: Index the documents to perform vectorization.embed_query
: Embed a query text for similarity search.recover_text
: Retrieve similar documents based on a query embedding.show_document
: Visualize the data and embeddings.file_exists
: Check if a document file already exists in the database.remove_document
: Remove a document from the database.toJson
: Serialize the current state of theTextVectorizer
to JSON.setVectorizer
: Set the vectorizer using a dictionary representation.save_to_json
: Save the current state to a JSON file.load_from_json
: Load vectorized documents and state from a JSON file.clear_database
: Clear the database and reset theTextVectorizer
instance.
License
This library is released under the Apache 2.0 license. See the LICENSE file for more details.
Contributing
We welcome contributions! If you find issues or have suggestions for improvements, please open an issue or create a pull request on the GitHub repository.
Make sure to replace `"example.txt"` and `"vectorization"` with your own document and query text for testing.
This README.md provides an overview of how to use the `TextVectorizer` class from the `safe_store` library and includes important information about installation, constructor parameters, methods, and licensing. Users can refer to this documentation to understand and utilize the functionality provided by the `TextVectorizer` class.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for safe_store-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7063f4103941999595e02fa3ec7e4b09650438b6a488d40b754e2653be809195 |
|
MD5 | d72116bb030a43cb9589458d9edf4f30 |
|
BLAKE2b-256 | a3aeec240aebc4dcae7b433d7dd282ea83dd7ce26507023afc19f110e5ee8c15 |