Skip to main content

SAGE RAG Framework - Document loaders, chunkers, retrievers for RAG pipelines

Project description

SAGE RAG

Document Loading, Chunking, Retrieval, and RAG Pipeline Orchestration

PyPI version Python 3.10+ License

SAGE RAG 是从 SAGE 框架中提取的独立 RAG 组件,提供完整的文档加载、文本分块、检索、重排序和 RAG 管道编排能力。

✨ 核心特性

📄 Document Loading

  • DocumentLoader 抽象接口 - 统一的文档加载接口
  • 支持多种格式:
    • 纯文本 (TextLoader)
    • PDF (PDFLoader)
    • Word 文档 (DocxLoader)
    • Markdown (MarkdownLoader)
    • HTML (HTMLLoader)
  • Document 数据类型 - 标准化文档表示

✂️ Text Chunking

  • TextChunker 抽象接口 - 文本分块策略
  • 多种分块算法:
    • 字符级分块 (CharacterSplitter)
    • 句子级分块 (SentenceSplitter)
    • Token 级分块 (TokenTextSplitter, SentenceTransformersTokenTextSplitter)
    • 语义分块 (SemanticChunker)
  • Chunk 数据类型 - 包含元数据的文本块

🔍 Retrieval

  • Retriever 抽象接口 - 检索策略
  • 支持多种后端:
    • SageVDB (isage-vdb)
    • ChromaDB
    • Milvus
    • FAISS
  • RetrievalResult 数据类型 - 检索结果表示

🎯 Reranking

  • Reranker 抽象接口 - 重排序策略
  • 算法支持:
    • Cross-encoder reranking
    • LLM-based reranking
    • Hybrid reranking

🔄 RAG Pipeline

  • RAGPipeline - 端到端 RAG 流程编排
  • 支持自定义组件组合
  • 内置评估指标

�� 安装

基础安装

pip install isage-rag

完整安装(包含所有可选功能)

pip install isage-rag[all]

按需安装

# 文档加载
pip install isage-rag[loaders]

# 文本分块
pip install isage-rag[chunking]

# 向量检索
pip install isage-rag[retrieval]

# 重排序
pip install isage-rag[reranking]

# LLM 生成
pip install isage-rag[generation]

# 评估指标
pip install isage-rag[evaluation]

🚀 快速开始

1. 文档加载

from sage.libs.rag.interface import create_document_loader

# 加载 PDF 文档
loader = create_document_loader("pdf")
documents = loader.load("path/to/document.pdf")

for doc in documents:
    print(f"Page {doc.metadata['page']}: {doc.content[:100]}...")

2. 文本分块

from sage.libs.rag.interface import create_text_chunker

# 创建分块器
chunker = create_text_chunker("sentence_transformers")

# 分块文档
chunks = chunker.chunk(documents, chunk_size=512, overlap=50)
print(f"Total chunks: {len(chunks)}")

3. 构建 RAG 管道

from sage.libs.rag.interface import create_rag_pipeline

# 创建完整 RAG 管道
pipeline = create_rag_pipeline("default")

# 配置组件
pipeline.configure(
    loader="pdf",
    chunker="sentence_transformers",
    retriever="sagedb",
    reranker="cross_encoder",
    generator="openai"
)

# 执行 RAG 查询
response = pipeline.query(
    "What are the main findings?",
    documents=["path/to/doc1.pdf", "path/to/doc2.pdf"]
)
print(response.answer)

4. 使用接口抽象

from sage.libs.rag.interface import DocumentLoader, TextChunker, Retriever

# 实现自定义文档加载器
class CustomLoader(DocumentLoader):
    def load(self, source: str) -> list[Document]:
        # 自定义加载逻辑
        pass

# 注册到工厂
from sage.libs.rag.interface import register_document_loader
register_document_loader("custom", CustomLoader)

# 使用工厂创建
loader = create_document_loader("custom")

🏗️ 架构

sage.libs.rag/
├── interface/              # 公共接口(从 sage-libs 导入)
│   ├── base.py            # DocumentLoader, TextChunker, Retriever, Reranker, RAGPipeline
│   ├── factory.py         # 工厂函数和注册表
│   └── __init__.py        # 公共 API
├── document_loaders/      # 文档加载器实现
│   ├── text_loader.py
│   ├── pdf_loader.py
│   ├── docx_loader.py
│   └── markdown_loader.py
├── chunk/                 # 文本分块实现
│   ├── character_splitter.py
│   ├── sentence_splitter.py
│   └── token_splitter.py
├── retrieval/            # 检索实现
│   ├── sagedb_retriever.py
│   ├── chroma_retriever.py
│   └── milvus_retriever.py
├── reranking/            # 重排序实现
│   ├── cross_encoder.py
│   └── llm_reranker.py
├── pipeline/             # RAG 管道编排
│   ├── default_pipeline.py
│   └── custom_pipeline.py
└── types/                # 数据类型定义
    ├── document.py
    ├── chunk.py
    └── retrieval_result.py

🔌 与 SAGE 集成

虽然 isage-rag 可以独立使用,但它与 SAGE 框架深度集成:

# 在 SAGE 项目中使用(通过 sage-libs)
from sage.libs.rag.interface import create_rag_pipeline

# 接口层在 sage-libs,实现在 isage-rag
pipeline = create_rag_pipeline("default")

在 SAGE 项目的 pyproject.toml 中:

[project.optional-dependencies]
rag = ["isage-rag>=0.1.0"]

安装 SAGE 时自动包含 RAG:

pip install sage-libs[rag]
# 或
pip install sage-libs[all]

📚 文档

🤝 贡献

欢迎贡献!请查看 CONTRIBUTING.md 了解详情。

📄 许可证

Apache License 2.0 - 详见 LICENSE

🔗 相关项目

📧 联系

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isage_rag-0.1.0.0-cp311-none-any.whl (18.1 kB view details)

Uploaded CPython 3.11

File details

Details for the file isage_rag-0.1.0.0-cp311-none-any.whl.

File metadata

  • Download URL: isage_rag-0.1.0.0-cp311-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: CPython 3.11
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for isage_rag-0.1.0.0-cp311-none-any.whl
Algorithm Hash digest
SHA256 66e8aa14d6a32490bd0205db9e93eda8f03ac6b13479ed8e7bea19dac72708fb
MD5 88efcb37baa17e1836d13fe722a7ea6b
BLAKE2b-256 a98d07806418542e8761ba1f97b4e0af803152ddb5268cee990ba15bfe58dcc8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page