A library for safe document storage and vectorization.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language

Project description

Certainly! Below is a detailed README.md for your Text Vectorizer library based on the information you provided. It includes explanations of how the library works and how to use it.

Text Vectorizer Library

Text Vectorizer is a Python library that facilitates text indexing and vectorization using various methods, including TF-IDF Vectorization and Model Embeddings. This library empowers you to efficiently vectorize and analyze text documents, making it suitable for a wide range of applications such as text search, content recommendation, and text similarity analysis.

GitHub Repo PyPI Version License Python Versions

Features

Supports two primary vectorization methods: TF-IDF Vectorization and Model Embeddings.
Visualize text embeddings in 2D using PCA (Principal Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding).
Search for similar text documents or chunks within your corpus.
Load and save vectorized documents for future use.

Installation

You can easily install the Text Vectorizer library using pip:

pip install safe_store

Getting Started

Initializing the Text Vectorizer

To start using the Text Vectorizer, you'll need to initialize it with the desired vectorization method, provide an optional model (in case of using embeddings), and specify a path for the database to save your vectorized documents.

from text_vectorizer import TextVectorizer, VectorizationMethod

# Initialize the Text Vectorizer
vectorizer = TextVectorizer(
    vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,  # Choose your preferred method
    model=None,  # Provide your model (if using model embeddings)
    database_path="text_db.json",  # Specify the path for the database
    save_db=False,  # Set to True to save the database
    data_visualization_method="PCA"  # Choose your visualization method (PCA or t-SNE)
)

Adding Documents

You can add documents to the Text Vectorizer using the add_document method. Specify the document name, the text content, chunk size, and overlap size.

# Add a document
vectorizer.add_document(
    document_name="example.txt",
    text="This is an example document. It can be longer and contain multiple paragraphs.",
    chunk_size=100,  # Set the chunk size for text decomposition
    overlap_size=20  # Set the overlap size between chunks
)

Indexing Documents

To enable searching and analysis, you need to index your documents using the index method.

# Index the documents
vectorizer.index()

Visualizing Text Embeddings

You can visualize the text embeddings in 2D using PCA or t-SNE with the show_document method. Pass a query text (optional), specify a path to save the visualization (optional), and set show_interactive_form to True if you want to display an interactive plot.

# Visualize the embeddings
vectorizer.show_document(
    query_text="Query text (optional)",
    save_fig_path="scatter_plot.png",  # Specify the path to save the visualization
    show_interactive_form=True  # Set to True to display an interactive plot
)

Searching for Similar Text

You can retrieve similar text documents to a query using the embed_query and recover_text methods. Provide a query text, and the library will return similar text chunks based on embeddings.

# Embed the query text
query_embedding = vectorizer.embed_query("Query text")

# Retrieve similar text documents (top_k specifies the number of similar documents to retrieve)
similar_texts, similarities = vectorizer.recover_text(query_embedding, top_k=3)

Clearing the Database

If needed, you can clear the database using the clear_database method. This removes all indexed documents and resets the Text Vectorizer.

# Clear the database
vectorizer.clear_database()

Author

ParisNeo

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

For more detailed usage and options, refer to the documentation.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language

Release history Release notifications | RSS feed

0.7.2

May 4, 2024

0.7.1

Apr 20, 2024

0.7.0

Apr 20, 2024

0.6.6

Apr 19, 2024

0.6.5

Apr 14, 2024

0.6.0

Apr 14, 2024

0.4.0

Dec 14, 2023

0.3.5

Dec 5, 2023

0.3.2

Nov 28, 2023

0.3.1

Nov 27, 2023

0.3.0

Nov 26, 2023

0.2.6

Oct 30, 2023

0.2.4

Oct 29, 2023

0.2.3

Oct 22, 2023

0.2.2

Oct 15, 2023

0.2.1

Oct 10, 2023

0.2.0

Oct 9, 2023

0.1.9

Oct 8, 2023

0.1.8

Oct 8, 2023

0.1.7

Oct 7, 2023

0.1.6

Oct 7, 2023

0.1.5

Oct 7, 2023

0.1.4

Oct 7, 2023

0.1.3

Oct 7, 2023

0.1.2

Oct 7, 2023

This version

0.1.1

Oct 7, 2023

0.1.0

Oct 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safe_store-0.1.1.tar.gz (15.2 kB view hashes)

Uploaded Oct 7, 2023 Source

Built Distribution

safe_store-0.1.1-py3-none-any.whl (14.7 kB view hashes)

Uploaded Oct 7, 2023 Python 3

Hashes for safe_store-0.1.1.tar.gz

Hashes for safe_store-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`171502f95f127a32c1cc5476bb16a92c97d12563bbc0a44706d5bfe565126203`
MD5	`d878021db598433925d71173b6562ed9`
BLAKE2b-256	`2250fe41357aacd5741512b951e01da9d18c71446ab9ec5cee0d7af39abffb2c`

Hashes for safe_store-0.1.1-py3-none-any.whl

Hashes for safe_store-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ddbbc8fb76b29653d8d88356fb3f8ab93cf2df41aaf728fab244302987e914ec`
MD5	`325e46cf324be033f62757f4a13576e9`
BLAKE2b-256	`a725353ddc8fe8e24d228c50c68f54f091e6f4ee4ba1929a6a06d712952a8f9e`