Skip to main content

A library for safe document storage and vectorization.

Project description

safe_store

License PyPI Python

safe_store Library - TextVectorizer Class

The TextVectorizer class is a part of the safe_store library, which is available on PyPI and is released under the Apache 2.0 license. This class provides functionality for text vectorization using various methods, such as TF-IDF vectorization or model-based embedding. It also offers features for document decomposition, visualization, and querying.

Installation

You can install the safe_store library using pip:

pip install safe_store

Usage

To use the TextVectorizer class, you need to import it from the library and create an instance. Here's an example of how to use it:

from safe_store import TextVectorizer, VectorizationMethod

# Create an instance of TextVectorizer
vectorizer = TextVectorizer(
    vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,
    database_path="database.json",
    save_db=True,
    visualize_data_at_startup=True,
    visualize_data_at_add_file=True,
    visualize_data_at_generate=True
)

# Add a document for vectorization
document_name = "example.txt"
text = "This is an example document for vectorization."
vectorizer.add_document(document_name, text, chunk_size=100, overlap_size=20, force_vectorize=False, add_as_a_bloc=False)

# Index the documents (perform vectorization)
vectorizer.index()

# Embed a query and retrieve similar documents
query_text = "vectorization"
query_embedding = vectorizer.embed_query(query_text)
similar_texts, _ = vectorizer.recover_text(query_embedding, top_k=3)
print("Similar Documents:")
for i, text in enumerate(similar_texts):
    print(f"{i + 1}: {text}")

Constructor Parameters

  • vectorization_method: Specify the vectorization method to use. Options are VectorizationMethod.MODEL_EMBEDDING or VectorizationMethod.TFIDF_VECTORIZER.
  • model: Provide a model instance when using model-based embedding (required if vectorization_method is VectorizationMethod.MODEL_EMBEDDING).
  • database_path: Path to the JSON database file where vectorized data is stored.
  • save_db: Boolean to determine whether to save vectorized data to the database file.
  • visualize_data_at_startup: Boolean to enable visualization of data at startup.
  • visualize_data_at_add_file: Boolean to enable visualization of data when adding a file.
  • visualize_data_at_generate: Boolean to enable visualization of data when generating embeddings.
  • data_visualization_method: Specify the visualization method for data. Options are "PCA" or "t-SNE".
  • database_dict: Optional dictionary to initialize the TextVectorizer state from a previous session.

Methods

  • add_document: Add a document for vectorization.
  • index: Index the documents to perform vectorization.
  • embed_query: Embed a query text for similarity search.
  • recover_text: Retrieve similar documents based on a query embedding.
  • show_document: Visualize the data and embeddings.
  • file_exists: Check if a document file already exists in the database.
  • remove_document: Remove a document from the database.
  • toJson: Serialize the current state of the TextVectorizer to JSON.
  • setVectorizer: Set the vectorizer using a dictionary representation.
  • save_to_json: Save the current state to a JSON file.
  • load_from_json: Load vectorized documents and state from a JSON file.
  • clear_database: Clear the database and reset the TextVectorizer instance.

License

This library is released under the Apache 2.0 license. See the LICENSE file for more details.

Contributing

We welcome contributions! If you find issues or have suggestions for improvements, please open an issue or create a pull request on the GitHub repository.


Make sure to replace `"example.txt"` and `"vectorization"` with your own document and query text for testing.

This README.md provides an overview of how to use the `TextVectorizer` class from the `safe_store` library and includes important information about installation, constructor parameters, methods, and licensing. Users can refer to this documentation to understand and utilize the functionality provided by the `TextVectorizer` class.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safe_store-0.1.0.tar.gz (14.7 kB view hashes)

Uploaded Source

Built Distribution

safe_store-0.1.0-py3-none-any.whl (14.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page