A library for safe document storage and vectorization.
Project description
Certainly! Below is a detailed README.md for your Text Vectorizer library based on the information you provided. It includes explanations of how the library works and how to use it.
Text Vectorizer Library
Text Vectorizer is a Python library that facilitates text indexing and vectorization using various methods, including TF-IDF Vectorization and Model Embeddings. This library empowers you to efficiently vectorize and analyze text documents, making it suitable for a wide range of applications such as text search, content recommendation, and text similarity analysis.
Features
- Supports two primary vectorization methods: TF-IDF Vectorization and Model Embeddings.
- Visualize text embeddings in 2D using PCA (Principal Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding).
- Search for similar text documents or chunks within your corpus.
- Load and save vectorized documents for future use.
Installation
You can easily install the Text Vectorizer library using pip
:
pip install safe_store
Getting Started
Initializing the Text Vectorizer
To start using the Text Vectorizer, you'll need to initialize it with the desired vectorization method, provide an optional model (in case of using embeddings), and specify a path for the database to save your vectorized documents.
from text_vectorizer import TextVectorizer, VectorizationMethod
# Initialize the Text Vectorizer
vectorizer = TextVectorizer(
vectorization_method=VectorizationMethod.TFIDF_VECTORIZER, # Choose your preferred method
model=None, # Provide your model (if using model embeddings)
database_path="text_db.json", # Specify the path for the database
save_db=False, # Set to True to save the database
data_visualization_method="PCA" # Choose your visualization method (PCA or t-SNE)
)
Adding Documents
You can add documents to the Text Vectorizer using the add_document
method. Specify the document name, the text content, chunk size, and overlap size.
# Add a document
vectorizer.add_document(
document_name="example.txt",
text="This is an example document. It can be longer and contain multiple paragraphs.",
chunk_size=100, # Set the chunk size for text decomposition
overlap_size=20 # Set the overlap size between chunks
)
Indexing Documents
To enable searching and analysis, you need to index your documents using the index
method.
# Index the documents
vectorizer.index()
Visualizing Text Embeddings
You can visualize the text embeddings in 2D using PCA or t-SNE with the show_document
method. Pass a query text (optional), specify a path to save the visualization (optional), and set show_interactive_form
to True
if you want to display an interactive plot.
# Visualize the embeddings
vectorizer.show_document(
query_text="Query text (optional)",
save_fig_path="scatter_plot.png", # Specify the path to save the visualization
show_interactive_form=True # Set to True to display an interactive plot
)
Searching for Similar Text
You can retrieve similar text documents to a query using the embed_query
and recover_text
methods. Provide a query text, and the library will return similar text chunks based on embeddings.
# Embed the query text
query_embedding = vectorizer.embed_query("Query text")
# Retrieve similar text documents (top_k specifies the number of similar documents to retrieve)
similar_texts, similarities = vectorizer.recover_text(query_embedding, top_k=3)
Clearing the Database
If needed, you can clear the database using the clear_database
method. This removes all indexed documents and resets the Text Vectorizer.
# Clear the database
vectorizer.clear_database()
Author
- ParisNeo
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For more detailed usage and options, refer to the documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for safe_store-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8627daa19a1ed0753f6e85563623e3863100e3814ff7cb8a3052bbfdc78a1068 |
|
MD5 | 8509a396eb527860a1a5b010cefac284 |
|
BLAKE2b-256 | 6686c6b836df488bd203c182b0a8ee4e47cb5b55a5a6348ab2fea6455c38d4a0 |