DATA 533 RAG Engine project with ingestion, indexing, retrieval, and CI
Project description
DataSage ๐งโโ๏ธ
PyPI: https://pypi.org/project/datasage-mds/
A lightweight, modular Python package for building Retrieval-Augmented Generation (RAG) systems. DataSage enables you to query your documents using natural language by combining semantic search with large language models (LLMs).
๐ Features
- Document Ingestion: Support for multiple file formats (CSV, XLSX, PDF, TXT).
- Efficient Chunking: Configurable text splitting with overlap for context preservation.
- Vector Storage: ChromaDB-backed vector database for efficient similarity search.
- Semantic Search: HuggingFace embeddings for accurate document retrieval.
- LLM Integration: Local LLM support via Ollama for answer generation.
- Modular Architecture: Easy to extend and customize components.
๐๏ธ Architecture
DataSage
โโโ Ingestion Layer โ Load and chunk documents
โโโ Indexing Layer โ Embed and store in vector database
โโโ Query Layer โ Retrieve relevant context and generate answers
โโโ RAG Pipeline โ End-to-end question answering system
๐ Prerequisites
- Python 3.10 or higher
- Ollama (for local LLM inference)
๐ Installation
1. Install from PyPI (recommended)
Package is published on PyPI:
https://pypi.org/project/datasage-mds/
pip install datasage-mds
### 2. Install Ollama
Download and install Ollama from [ollama.com](https://ollama.com/download).
Once installed, in a separate terminal do the following:
Pull a model:
```bash
ollama pull llama3.1
Verify installation:
ollama run llama3.1
Supported File Formats
- CSV: Loaded with metadata for each row
- PDF: Extracted page by page
- TXT: Loaded as single document
- XLSX: Extracted sheet by sheet
๐ฏ Use Cases
- Document Q&A: Query large documents using natural language
- Knowledge Base Search: Build searchable knowledge bases
- Customer Support: Answer questions from documentation
- Research Assistant: Extract information from academic papers
- Code Documentation: Query codebases and technical docs
Contributors
Yihang Wang
- Sub-package: ingestion
- Modules: loaders.py, chunker.py
Aaron Sukare
- Sub-package: indexing
- Modules: embedder.py, vector_store.py, index_engine.py
Zaed Khan
- Sub-package: retrieval
- Modules: rag_engine/init.py, generator.py, retriever.py, data_models.py
๐ค Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ Acknowledgments
- Built with LangChain
- Embeddings powered by HuggingFace
- Vector storage by ChromaDB
- Local LLM inference via Ollama
๐ง Contact
For questions or support, please open an issue on GitHub.
Made with โค๏ธ by the DataSage Team
datasage_data533_step_3
โโ .DS_Store
โโ coverage.json
โโ datasage_store
โ โโ chroma.sqlite3
โโ main.py
โโ project_description.pdf
โโ rag_engine
โ โโ .DS_Store
โ โโ indexing
โ โ โโ embedder.py
โ โ โโ indexing_documentation_updated.md
โ โ โโ index_engine.py
โ โ โโ testing_readme.md
โ โ โโ vector_store.py
โ โโ ingestion
โ โ โโ chunker.py
โ โ โโ coverage_ingestion
โ โ โ โโ coveragehtml_ingestion.png
โ โ โ โโ coverage_ingestion.png
โ โ โโ documentation.md
โ โ โโ loaders.py
โ โ โโ README.md
โ โ โโ __init__.py
โ โโ retrieval
โ โ โโ data_models.py
โ โ โโ documentation.md
โ โ โโ generator.py
โ โ โโ README.md
โ โ โโ retriever.py
โ โ โโ __init__.py
โ โโ tests
โ โ โโ coverage_report.png
โ โ โโ test_csv_loader.py
โ โ โโ test_data_models.py
โ โ โโ test_embedder.py
โ โ โโ test_generator.py
โ โ โโ test_index_engine.py
โ โ โโ test_pdf_loader.py
โ โ โโ test_retriever.py
โ โ โโ test_text_chunker.py
โ โ โโ test_txt_loader.py
โ โ โโ test_vector_store.py
โ โ โโ __init__.py
โ โโ rag_engine.py
โ โโ __init__.py
โโ README.md
โโ pyproject.toml
โโ requirements.txt
โโ search_test.txt
โโ test_data.csv
โโ utils_test.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasage_mds-0.0.2.tar.gz.
File metadata
- Download URL: datasage_mds-0.0.2.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7767c9f57565624230d74c5d8bcafc44baed220b359b7617c265445c3ec74fc7
|
|
| MD5 |
913318028ffbe10f4f92dfc94ade7a7e
|
|
| BLAKE2b-256 |
38164066492b7f7c5fa8e9851b88bc6476081c4b4af3fd8220b730afcc52e528
|
File details
Details for the file datasage_mds-0.0.2-py3-none-any.whl.
File metadata
- Download URL: datasage_mds-0.0.2-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
153adffc0aff8fa8d75718a89eb79dcfe10f0f53d7497989d7b0ef1d4a2b6554
|
|
| MD5 |
0eded7674b300406da4336e2dc0911c3
|
|
| BLAKE2b-256 |
c762b76fc8f84f2e4c643f7c810c86823d4183f0be672ba5f34b5d378094bf0c
|