A library for safe document storage and vectorization.
Project description
safe_store
safe_store
is an open-source Python library that provides essential tools for text data management, vectorization, and document retrieval. It empowers users to work with text documents efficiently and effortlessly.
- safe_store
Key Features:
1. Text Vectorizer
- Versatile Vectorization: Choose between TF-IDF vectorization, model-based embeddings to convert text documents into numerical representations or use BM25 ranking for text retreival.
- Document Similarity: Find documents similar to a given query text, making it ideal for document retrieval tasks.
- Interactive Visualization: Visualize document embeddings in a scatter plot to gain insights into document relationships.
- No Authentication Required: Use the library without the need for API keys or authentication, making it accessible for everyone.
- Commercially Usable:
safe_store
is 100% open-source and free to use, even for commercial purposes, under the Apache 2.0 License.
2. Generic Data Loader
- Multi-format Support: Read various file formats, including PDF, DOCX, JSON, HTML, and more.
- Simplified Text Extraction: Convert file content to plain text or data structures with ease.
- Efficient and Time-Saving: Streamline data loading and processing tasks, reducing the need for manual extraction.
What Can You Use safe_store
For?
- Text Document Analysis: Analyze and understand the content of text documents quickly and efficiently.
- Document Retrieval: Retrieve documents similar to a given query text, facilitating content recommendation and search tasks.
- Text Data Preprocessing: Prepare text data for natural language processing (NLP) tasks, such as sentiment analysis and text classification.
- Data Loading: Streamline the process of reading and extracting content from various file formats.
safe_store
is designed to be accessible, versatile, and free for all users. It's an ideal choice for developers, data scientists, and researchers who want a user-friendly and open-source solution for working with text data.
Explore the world of text data management and analysis with safe_store
today!
Text Vectorizer
Features
- Vectorize and index text documents.
- Retrieve similar documents based on a query.
- Supports both TF-IDF vectorization and model-based embeddings.
- Interactive visualization of document embeddings.
- No authentication or API keys required.
Installation
To install safe_store
, you can use pip
:
pip install safe_store
Getting Started
Initializing the Text Vectorizer
from safe_store import TextVectorizer, VectorizationMethod
from pathlib import Path
# Create an instance of TextVectorizer
vectorizer = TextVectorizer(
vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,
database_path="database.json",
save_db=False
)
Adding and Indexing Documents
# Add documents for vectorization
documents = ["llm", "space", "submarines", "new york"]
for doc in documents:
document_name = Path(__file__).parent / f"{doc}.txt"
with open(document_name, 'r', encoding='utf-8') as file:
text = file.read()
vectorizer.add_document(document_name, text, chunk_size=100, overlap_size=20, force_vectorize=False, add_as_a_bloc=False)
# Index the documents (perform vectorization)
vectorizer.index()
Embedding a Query and Retrieving Similar Documents
# Embed a query and retrieve similar documents
query_text = "what is space"
similar_texts, _, _ = vectorizer.recover_text(query_text, top_k=3)
# Show the interactive document visualization
vectorizer.show_document(show_interactive_form=True)
print("Similar Documents:")
for i, text in enumerate(similar_texts):
print(f"{i + 1}: {text}")
The vectorizer.show_document(show_interactive_form=True)
should yield a plot like this where you can read the text by pointing on the dots. Each dot is a chunk of the text. We can clearly see that chunks that come from the same document tend to form a cluster.
Generic Data Loader
Features
- Read various file formats including PDF, DOCX, JSON, HTML, and more.
- Convert file content to text or data structures.
Usage
To read a file using GenericDataLoader
, you can use the read_file
method and provide the file path:
from safe_store import GenericDataLoader
from pathlib import Path
file_path = Path("example.pdf")
file_content = GenericDataLoader.read_file(file_path)
Supported File Types
- DOCX
- JSON
- HTML
- PPTX
- TXT
- RTF
- MD
- LOG
- CPP
- Java
- JS
- Python
- Ruby
- Shell Script
- SQL
- CSS
- PHP
- XML
- YAML
- INI
- INF
- MAP
- BAT
Feel free to replace "example.pdf"
with the path to your specific file.
Author
- ParisNeo
License
This project is licensed under the Apache 2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for safe_store-0.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b6f38b018bea9a34d14429df581d62774205c0f79c05a7e20d71350aec585a9 |
|
MD5 | 5c09720042d97a4398c18aa890a286bd |
|
BLAKE2b-256 | 1dc797a093bc0684588833d01af4f5c7e4a74408e494d41cb9ff243fdbcf5376 |