A high-performance, asynchronous, and extensible Python package for processing files, generating embeddings, and storing them in various vector databases with optional cloud storage integration.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

🚀 EmbeddingFramework

Modular • Extensible • Production-Ready
A Python framework for embeddings, vector databases, and cloud storage providers.

📚 Documentation

A modular, extensible, and production-ready Python framework for working with embeddings, vector databases, and cloud storage providers.
Designed for AI, NLP, and semantic search applications, EmbeddingFramework provides a unified API to process, store, and query embeddings across multiple backends.

✨ Features

🔹 Multi-Vector Database Support

ChromaDB – Local and persistent vector storage.
Milvus – High-performance distributed vector database.
Pinecone – Fully managed vector database service.
Weaviate – Open-source vector search engine.

🔹 Cloud Storage Integrations

AWS S3 – Store and retrieve embeddings or documents.
Google Cloud Storage (GCS) – Scalable object storage.
Azure Blob Storage – Enterprise-grade cloud storage.

🔹 Embedding Providers

OpenAI Embeddings – State-of-the-art embedding generation.
Easily extendable to other providers.

🔹 File Processing & Preprocessing

Automatic file type detection.
Text extraction from multiple formats including .txt, .pdf, .docx, .csv, .xls, .xlsx.
Preprocessing utilities for cleaning and normalizing text.
Intelligent text splitting for optimal embedding performance.
Large dataset handling for Excel files with efficient chunking to preserve embedding context.

🔹 Utilities

Retry logic for robust API calls.
File utilities for safe and efficient I/O.
Modular architecture for easy extension.

📦 Installation & Setup

# Basic installation
pip install embeddingframework

# With development dependencies
pip install embeddingframework[dev]

⚡ Quick Start Example

from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter
from embeddingframework.adapters.vector_dbs import ChromaDBAdapter

# Initialize embedding provider
embedding_provider = OpenAIEmbeddingAdapter(api_key="YOUR_OPENAI_API_KEY")

# Initialize vector database
vector_db = ChromaDBAdapter(persist_directory="./chroma_store")

# Generate embeddings
embeddings = embedding_provider.embed_texts(["Hello world", "EmbeddingFramework is awesome!"])

# Store embeddings
vector_db.add_texts(["Hello world", "EmbeddingFramework is awesome!"], embeddings)

📂 Project Structure

embeddingframework/
│
├── adapters/                # Vector DB & storage adapters
│   ├── base.py
│   ├── chromadb_adapter.py
│   ├── milvus_adapter.py
│   ├── pinecone_adapter.py
│   ├── weaviate_adapter.py
│   ├── storage/             # Cloud storage adapters
│
├── processors/              # File processing logic
├── utils/                    # Helper utilities
└── tests/                    # Test suite

🧪 Testing

pytest --maxfail=1 --disable-warnings -q

With coverage:

pytest --cov=embeddingframework --cov-report=term-missing

🔄 CI/CD

This project includes a GitHub Actions workflow (.github/workflows/python-package.yml) for:

Automated testing with coverage.
Version bumping & changelog generation.
PyPI publishing.
GitHub release creation.

📜 License

MIT License

This project is licensed under the MIT License – see the LICENSE file for details.

🤝 Contributing

Contributions, issues, and feature requests are welcome!
Feel free to check the issues page.

Fork the repository.
Create a new branch (feature/my-feature).
Commit your changes.
Push to your branch.
Open a Pull Request.

🌟 Why EmbeddingFramework?

Unified API – Work with multiple vector DBs and storage providers seamlessly.
Extensible – Add new adapters with minimal effort.
Production-Ready – Built with scalability and reliability in mind.
Developer-Friendly – Clean, modular, and well-documented codebase.

📖 Full Documentation Overview

Below is a comprehensive, end-to-end guide covering all features, usage patterns, and advanced configurations of EmbeddingFramework.

1️⃣ Introduction

EmbeddingFramework is designed to simplify the integration of embeddings, vector databases, and cloud storage into AI-powered applications. It provides:

A unified API for multiple backends.
Extensible architecture for adding new providers.
Production-ready reliability with retries, error handling, and modular design.

2️⃣ Installation

pip install embeddingframework
pip install embeddingframework[dev]  # For development

3️⃣ Supported Vector Databases

Database	Type	Key Features
ChromaDB	Local	Persistent storage, lightweight
Milvus	Distributed	High-performance, scalable
Pinecone	Managed	Fully hosted, easy to scale
Weaviate	Open-source	Semantic search, hybrid queries

4️⃣ Cloud Storage Integrations

EmbeddingFramework supports:

AWS S3
Google Cloud Storage
Azure Blob Storage

Example:

from embeddingframework.adapters.storage.s3_storage_adapter import S3StorageAdapter
storage = S3StorageAdapter(bucket_name="my-bucket")
storage.upload_file("local.txt", "remote.txt")

5️⃣ Embedding Providers

Currently supported:

OpenAI Embeddings
Easily extendable to HuggingFace, Cohere, etc.

Example:

from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter
provider = OpenAIEmbeddingAdapter(api_key="YOUR_KEY")
embeddings = provider.embed_texts(["Hello", "World"])

6️⃣ File Processing

EmbeddingFramework provides a robust and extensible file processing pipeline that can handle a wide variety of file formats and sizes. This includes:

Automatic File Type Detection – The framework automatically determines the file type and routes it to the appropriate parser.
Text Extraction – Supports extracting text from:
- .txt – Plain text files
- .pdf – PDF documents
- .docx – Microsoft Word documents
- .csv – Comma-separated values
- .xls / .xlsx – Microsoft Excel spreadsheets (including multi-sheet workbooks)
Preprocessing Utilities – Cleans and normalizes extracted text for better embedding quality (e.g., removing stopwords, normalizing whitespace).
Intelligent Text Splitting – Splits large documents into smaller, context-friendly chunks for optimal embedding performance.
Large Dataset Handling for Excel – Efficiently processes large Excel files by:
- Reading all sheets in the workbook.
- Converting each row into a string representation.
- Chunking rows into manageable segments to avoid exceeding embedding context limits.
- Applying quality filters to remove empty or low-value chunks.

This design ensures that even massive datasets can be processed without memory overload or loss of semantic context.

Example:

from embeddingframework.processors.file_processor import FileProcessor

processor = FileProcessor()

# Process a PDF
pdf_text = processor.process_file("document.pdf")

# Process a large Excel file with multiple sheets
excel_text = processor.process_file("large_dataset.xlsx")

# Process a CSV file
csv_text = processor.process_file("data.csv")

# Process a DOCX file
docx_text = processor.process_file("report.docx")

Advanced Usage:

# Asynchronous processing with custom chunk sizes and quality filters
import asyncio

async def process_files():
    await processor.process_file_async(
        "large_dataset.xlsx",
        chunk_size=2000,
        text_chunk_size=1000,
        merge_target_size=3000,
        parallel=True,
        min_quality_length=50
    )

asyncio.run(process_files())

7️⃣ Utilities

Retry logic
File utilities
Preprocessing helpers

8️⃣ CLI Usage

EmbeddingFramework includes a CLI:

embeddingframework --help

9️⃣ Advanced Configurations

Custom vector DB adapters
Custom embedding providers
Batch processing
Async support

🔟 End-to-End Example

from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter
from embeddingframework.adapters.vector_dbs import ChromaDBAdapter

provider = OpenAIEmbeddingAdapter(api_key="KEY")
db = ChromaDBAdapter(persist_directory="./store")

texts = ["AI is amazing", "EmbeddingFramework is powerful"]
embeddings = provider.embed_texts(texts)
db.add_texts(texts, embeddings)

📊 Feature Matrix

Feature	Supported
Multi-DB Support	✅
Cloud Storage	✅
File Processing	✅
Retry Logic	✅
CLI	✅
Async	✅

📚 Learn More

For the full documentation, visit:
👉 EmbeddingFramework Docs

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.8

Sep 10, 2025

1.0.7

Aug 29, 2025

1.0.6

Aug 29, 2025

1.0.5

Aug 28, 2025

1.0.4

Aug 28, 2025

1.0.3

Aug 28, 2025

1.0.2

Aug 28, 2025

1.0.1

Aug 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embeddingframework-1.0.8.tar.gz (24.7 kB view details)

Uploaded Sep 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embeddingframework-1.0.8-py3-none-any.whl (28.2 kB view details)

Uploaded Sep 10, 2025 Python 3

File details

Details for the file embeddingframework-1.0.8.tar.gz.

File metadata

Download URL: embeddingframework-1.0.8.tar.gz
Upload date: Sep 10, 2025
Size: 24.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for embeddingframework-1.0.8.tar.gz
Algorithm	Hash digest
SHA256	`645a399a23b47e4b18bfe856bac612954e56ddb7b9503b376cbe66ac0ca24bc7`
MD5	`5282f6133e75e6701cba7199b75c7a35`
BLAKE2b-256	`30f02d3903e22c3caba46c673e087eb7b7dd595a7f2080fc1737a9c83ed33589`

See more details on using hashes here.

File details

Details for the file embeddingframework-1.0.8-py3-none-any.whl.

File metadata

Download URL: embeddingframework-1.0.8-py3-none-any.whl
Upload date: Sep 10, 2025
Size: 28.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for embeddingframework-1.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a601a5e9367d5e6c2099ce8fb2bcc810afa4a2c7837c425cd16b5c8bbdbb371`
MD5	`b91eb994dcbf8787c444a93d6bcdcf85`
BLAKE2b-256	`98dda12f347df3c44497dacc02258c64bd971e5ed7b413f86274e22eb2d72666`

See more details on using hashes here.

embeddingframework 1.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 EmbeddingFramework

📚 Documentation

✨ Features

🔹 Multi-Vector Database Support

🔹 Cloud Storage Integrations

🔹 Embedding Providers

🔹 File Processing & Preprocessing

🔹 Utilities

📦 Installation & Setup

⚡ Quick Start Example

📂 Project Structure

🧪 Testing

🔄 CI/CD

📜 License

🤝 Contributing

🌟 Why EmbeddingFramework?

📖 Full Documentation Overview

1️⃣ Introduction

2️⃣ Installation

3️⃣ Supported Vector Databases

4️⃣ Cloud Storage Integrations

5️⃣ Embedding Providers

6️⃣ File Processing

7️⃣ Utilities

8️⃣ CLI Usage

9️⃣ Advanced Configurations

🔟 End-to-End Example

📊 Feature Matrix

📚 Learn More

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes