Skip to main content

A library for safe document storage and vectorization.

Project description

safe_store

GitHub Repo PyPI Version License Python Versions

safe_store is an open-source Python library that provides essential tools for text data management, vectorization, and document retrieval. It empowers users to work with text documents efficiently and effortlessly.

Key Features:

1. Text Vectorizer

  • Versatile Vectorization: Choose between TF-IDF vectorization, model-based embeddings to convert text documents into numerical representations or use BM25 ranking for text retreival.
  • Document Similarity: Find documents similar to a given query text, making it ideal for document retrieval tasks.
  • Interactive Visualization: Visualize document embeddings in a scatter plot to gain insights into document relationships.
  • No Authentication Required: Use the library without the need for API keys or authentication, making it accessible for everyone.
  • Commercially Usable: safe_store is 100% open-source and free to use, even for commercial purposes, under the Apache 2.0 License.

2. Generic Data Loader

  • Multi-format Support: Read various file formats, including PDF, DOCX, JSON, HTML, and more.
  • Simplified Text Extraction: Convert file content to plain text or data structures with ease.
  • Efficient and Time-Saving: Streamline data loading and processing tasks, reducing the need for manual extraction.

What Can You Use safe_store For?

  • Text Document Analysis: Analyze and understand the content of text documents quickly and efficiently.
  • Document Retrieval: Retrieve documents similar to a given query text, facilitating content recommendation and search tasks.
  • Text Data Preprocessing: Prepare text data for natural language processing (NLP) tasks, such as sentiment analysis and text classification.
  • Data Loading: Streamline the process of reading and extracting content from various file formats.

safe_store is designed to be accessible, versatile, and free for all users. It's an ideal choice for developers, data scientists, and researchers who want a user-friendly and open-source solution for working with text data.


Explore the world of text data management and analysis with safe_store today!

Text Vectorizer

Features

  • Vectorize and index text documents.
  • Retrieve similar documents based on a query.
  • Supports both TF-IDF vectorization and model-based embeddings.
  • Interactive visualization of document embeddings.
  • No authentication or API keys required.

Installation

To install safe_store, you can use pip:

pip install safe_store

Getting Started

Initializing the Text Vectorizer

from safe_store import TextVectorizer, VectorizationMethod
from pathlib import Path

# Create an instance of TextVectorizer
vectorizer = TextVectorizer(
    vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,
    database_path="database.json",
    save_db=False
)

Adding and Indexing Documents

# Add documents for vectorization
documents = ["llm", "space", "submarines", "new york"]
for doc in documents:
    document_name = Path(__file__).parent / f"{doc}.txt"
    with open(document_name, 'r', encoding='utf-8') as file:
        text = file.read()
    vectorizer.add_document(document_name, text, chunk_size=100, overlap_size=20, force_vectorize=False, add_as_a_bloc=False)

# Index the documents (perform vectorization)
vectorizer.index()

Embedding a Query and Retrieving Similar Documents

# Embed a query and retrieve similar documents
query_text = "what is space"
similar_texts, _, _ = vectorizer.recover_text(query_text, top_k=3)

# Show the interactive document visualization
vectorizer.show_document(show_interactive_form=True)

print("Similar Documents:")
for i, text in enumerate(similar_texts):
    print(f"{i + 1}: {text}")

The vectorizer.show_document(show_interactive_form=True) should yield a plot like this where you can read the text by pointing on the dots. Each dot is a chunk of the text. We can clearly see that chunks that come from the same document tend to form a cluster. image


Generic Data Loader

Features

  • Read various file formats including PDF, DOCX, JSON, HTML, and more.
  • Convert file content to text or data structures.

Usage

To read a file using GenericDataLoader, you can use the read_file method and provide the file path:

from safe_store import GenericDataLoader
from pathlib import Path

file_path = Path("example.pdf")
file_content = GenericDataLoader.read_file(file_path)

Supported File Types

  • PDF
  • DOCX
  • JSON
  • HTML
  • PPTX
  • TXT
  • RTF
  • MD
  • LOG
  • CPP
  • Java
  • JS
  • Python
  • Ruby
  • Shell Script
  • SQL
  • CSS
  • PHP
  • XML
  • YAML
  • INI
  • INF
  • MAP
  • BAT

Feel free to replace "example.pdf" with the path to your specific file.


Author

  • ParisNeo

License

This project is licensed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safe_store-0.7.2.tar.gz (20.9 kB view hashes)

Uploaded Source

Built Distribution

safe_store-0.7.2-py3-none-any.whl (20.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page