RAG Knowledge Preparation in Python

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3.12
Topic
- Software Development :: Libraries :: Python Modules

Project description

RAG Knowledge Preparation Python

A comprehensive Python library for preparing knowledge bases for Retrieval-Augmented Generation (RAG) systems. This library provides powerful tools for document processing with OCR capabilities, advanced table processing, and intelligent codebase analysis.

Features

Document Processing Features

Multi-format Support: Convert PDF, DOCX, HTML, CSV, and other formats to Markdown
OCR Integration: Extract text from scanned documents using EasyOCR or Tesseract
Advanced Table Processing: Intelligent table detection and conversion using TableFormer
Batch Processing: Process multiple documents or entire folders efficiently
Configurable Quality: Multiple processing presets for different use cases

Codebase Analysis Features

Comprehensive Analysis: Extract structure, dependencies, and metadata from codebases
Multi-language Support: Python, JavaScript, TypeScript and more
AI-Powered Summaries: Generate intelligent code summaries using Google Gemini
Dependency Analysis: Identify and categorize internal, external, and standard library dependencies
Structure Extraction: Parse classes, functions, imports, and code organization
Token Estimation: Accurate token counting for RAG optimization

Configuration & Customization

Flexible Configuration: Extensive configuration options for both document and codebase processing
Preset Configurations: Pre-built configurations for common use cases
Custom Metadata: Configurable metadata fields for different analysis needs
Performance Optimization: Built-in performance modes for large-scale processing

Installation

pip install rag-knowledge-preparation-python

Development Installation

git clone 
cd rag-knowledge-preparation-python
pip install -e ".[dev]"

Quick Start

Document Processing

from rag_knowledge_preparation import (
    convert_document_to_markdown,
    convert_scanned_document_to_markdown,
    convert_documents_batch
)

# Convert a single document
markdown_content = convert_document_to_markdown("document.pdf")

# Convert a scanned document with OCR
scanned_content = convert_scanned_document_to_markdown("scanned_document.pdf")

# Process multiple documents
results = convert_documents_batch(["doc1.pdf", "doc2.docx", "doc3.html"])

Codebase Analysis

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview
)

# Export entire codebase to Markdown
output_file = export_codebase_to_markdown("./my_project", "codebase_export.md")

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

# Get high-level overview
overview = get_codebase_overview("./my_project")

Document Processing Details

Supported Formats

PDF: Native PDF processing with OCR support
Microsoft Office: DOCX, DOC, PPTX, PPT
Web Formats: HTML, XML
Data Formats: CSV, TSV, JSON
Text Formats: TXT, MD, RST

Processing Presets

Basic Processing

from rag_knowledge_preparation import convert_document_to_markdown

# Basic processing without OCR
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="basic"
)

Standard Document Processing

# Standard processing with OCR and advanced tables
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="standard"
)

OCR-Heavy Processing

# Heavy OCR processing for scanned documents
content = convert_document_to_markdown(
    "scanned_document.pdf", 
    processing_preset="ocr_heavy"
)

Table-Focused Processing

# Optimized for documents with complex tables
content = convert_document_to_markdown(
    "data_heavy_document.pdf", 
    processing_preset="table_focused"
)

High-Quality Processing

# Maximum quality with all features enabled
content = convert_document_to_markdown(
    "important_document.pdf", 
    processing_preset="high_quality"
)

Custom Configuration

from rag_knowledge_preparation import convert_document_to_markdown

# Custom configuration
content = convert_document_to_markdown(
    "document.pdf",
    processing_preset="standard",
    ocr_engine="tesseract",
    ocr_language="en",
    table_confidence_threshold=0.9,
    enable_cell_matching=True
)

Batch Processing

from rag_knowledge_preparation import convert_documents_batch, convert_folder_to_markdown

# Process multiple files
results = convert_documents_batch([
    "document1.pdf",
    "document2.docx", 
    "document3.html"
])

# Process entire folder
folder_results = convert_folder_to_markdown("./documents/")

Codebase Analysis Usage

Basic Analysis

from rag_knowledge_preparation import analyze_codebase_structure

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

print(f"Total files: {structure['total_files']}")
print(f"Total lines: {structure['total_lines']}")
print(f"Languages: {structure['languages']}")

Export to Markdown

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with default settings
output_file = export_codebase_to_markdown("./my_project")

# Export with custom output file
output_file = export_codebase_to_markdown(
    "./my_project", 
    output_file="my_codebase.md"
)

AI-Powered Analysis

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with AI summaries (requires Gemini API key)
output_file = export_codebase_to_markdown(
    "./my_project",
    gemini_api_key="your-gemini-api-key",
    gemini_model="gemini-pro"
)

Codebase Processing Presets

Minimal Processing

from rag_knowledge_preparation import export_codebase_to_markdown

# Minimal processing - basic analysis only
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="minimal"
)

Standard Processing

# Standard processing with full analysis
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="standard"
)

Comprehensive Processing

# Comprehensive processing with all features
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="comprehensive"
)

Configuration Options

from rag_knowledge_preparation import (
    CodebaseProcessingConfig,
    MetadataConfig,
    export_codebase_to_markdown
)

# Custom configuration
config = CodebaseProcessingConfig(
    max_file_size_mb=2.0,
    include_test_files=False,
    include_documentation=True,
    enable_ai_summary=True,
    gemini_api_key="your-api-key",
    custom_ignore_patterns=["*.log", "temp/*"]
)

# Custom metadata configuration
metadata_config = MetadataConfig(
    include_file_path=True,
    include_language=True,
    include_purpose=True,
    include_dependencies=True,
    include_structure=True,
    include_summary=True
)

config.metadata_config = metadata_config

# Use custom configuration
output_file = export_codebase_to_markdown(
    "./my_project",
    processing_preset="custom",
    **config.model_dump()
)

Advanced Features

Language Detection and Classification

The library automatically detects programming languages and classifies files by purpose:

from rag_knowledge_preparation.codebase_processing.analysis import (
    get_language_from_extension,
    classify_file_by_purpose
)

# Detect language from file extension
language = get_language_from_extension("script.py")  # Returns "python"

# Classify file by purpose
purpose = classify_file_by_purpose("test_utils.py")  # Returns "Tests"

Dependency Analysis

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import analyze_file_dependencies

# Analyze dependencies in a Python file
with open("main.py", "r") as f:
    content = f.read()
dependencies = analyze_file_dependencies(content, Path("main.py"), "python")

print("External packages:", dependencies["external_packages"])
print("Standard library:", dependencies["standard_library"])
print("Internal modules:", dependencies["internal_modules"])

Code Structure Extraction

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import extract_code_structure

# Extract structure from code file
code_content = """
class MyClass:
    def __init__(self):
        pass
    
    def method(self):
        pass
"""
structure = extract_code_structure(Path("example.py"), "python", code_content)

print("Classes:", structure["classes"])
print("Functions:", structure["functions"])

Token Estimation

from rag_knowledge_preparation.codebase_processing.analysis import estimate_token_count

# Estimate tokens in text
token_count = estimate_token_count("Hello, world!")
print(f"Estimated tokens: {token_count}")

# Estimate tokens in code
code_tokens = estimate_token_count("""
def hello():
    print("Hello, world!")
""")

Configuration Reference

Document Processing Configuration

Parameter	Type	Default	Description
`enable_ocr`	bool	True	Enable OCR processing
`table_processing`	str	"advanced"	Table processing mode (basic, advanced, tableformer)
`ocr_engine`	str	"easyocr"	OCR engine (easyocr, tesseract)
`ocr_language`	str	"en"	OCR language (en, fr, de, es)
`table_confidence_threshold`	float	0.8	Table detection confidence threshold
`enable_cell_matching`	bool	True	Enable cell matching in tables
`enable_table_structure`	bool	True	Enable table structure analysis

Codebase Processing Configuration

Parameter	Type	Default	Description
`max_file_size_mb`	float	1.0	Maximum file size to process
`include_hidden_files`	bool	False	Include hidden files
`include_test_files`	bool	True	Include test files
`include_documentation`	bool	True	Include documentation files
`include_config_files`	bool	True	Include configuration files
`enable_structure_analysis`	bool	True	Enable code structure analysis
`enable_ai_summary`	bool	True	Enable AI-powered summaries
`gemini_api_key`	str	None	Google Gemini API key
`gemini_model`	str	"gemini-pro"	Gemini model to use
`custom_ignore_patterns`	List[str]	None	Custom ignore patterns

Error Handling

The library provides comprehensive error handling with custom exceptions:

from rag_knowledge_preparation import (
    RAGKnowledgePreparationError,
    DocumentNotFoundError,
    ConfigurationError,
    ConversionError,
    UnsupportedFormatError
)

try:
    content = convert_document_to_markdown("nonexistent.pdf")
except DocumentNotFoundError as e:
    print(f"Document not found: {e}")
except ConversionError as e:
    print(f"Conversion failed: {e}")
except ConfigurationError as e:
    print(f"Configuration error: {e}")

Performance Considerations

Large File Processing

The library includes built-in optimizations for large files:

File Size Limits: Configurable maximum file size limits
Memory Efficiency: Streaming processing for large documents
Batch Processing: Efficient processing of multiple files
Parallel Processing: Concurrent processing where possible

Performance Modes

# Use performance-optimized settings
config = CodebaseProcessingConfig(
    max_file_size_mb=0.5,  # Smaller file limit
    enable_ai_summary=False,  # Disable AI for speed
    enable_structure_analysis=False  # Disable structure analysis
)

Examples

Complete Document Processing Pipeline

from rag_knowledge_preparation import (
    convert_folder_to_markdown,
    list_document_configs
)

# List available configurations
configs = list_document_configs()
print("Available configurations:", list(configs.keys()))

# Process entire document folder
results = convert_folder_to_markdown(
    "./documents/",
    processing_preset="high_quality"
)

# Save results
for file_path, content in results.items():
    output_path = f"processed_{file_path.split('/')[-1]}.md"
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)

Complete Codebase Analysis Pipeline

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview,
    list_available_codebase_configs
)

# List available configurations
configs = list_available_codebase_configs()
print("Available configurations:", list(configs.keys()))

# Get overview
overview = get_codebase_overview("./my_project")
print(f"Project: {overview['name']}")
print(f"Files: {overview['total_files']}")
print(f"Languages: {overview['languages']}")

# Analyze structure
structure = analyze_codebase_structure("./my_project")
print(f"Structure analysis complete: {structure['total_files']} files processed")

# Export to Markdown
output_file = export_codebase_to_markdown(
    "./my_project",
    output_file="project_analysis.md",
    gemini_api_key="your-api-key"
)
print(f"Exported to: {output_file}")

Acknowledgments

Docling for document processing capabilities
Tree-sitter for code parsing
Google Gemini for AI-powered summarization
Pygments for syntax highlighting and language detection

Changelog

Version 1.0.0

Initial release
Document processing with OCR support
Codebase analysis and export
AI-powered summarization
Comprehensive configuration options
Multi-language support

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3.12
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

1.0.3

Feb 23, 2026

1.0.2

Dec 1, 2025

1.0.1

Nov 18, 2025

This version

1.0.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_knowledge_preparation-1.0.0.tar.gz (33.2 kB view details)

Uploaded Oct 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_knowledge_preparation-1.0.0-py3-none-any.whl (28.9 kB view details)

Uploaded Oct 21, 2025 Python 3

File details

Details for the file rag_knowledge_preparation-1.0.0.tar.gz.

File metadata

Download URL: rag_knowledge_preparation-1.0.0.tar.gz
Upload date: Oct 21, 2025
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rag_knowledge_preparation-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d71efa6d0ffd5e09b722eb84b6ec0b95859b6b955328e26a797ab48354df5198`
MD5	`3d7d3f5e4c873ddf37da96e4383cf786`
BLAKE2b-256	`d18c0a97d106e7d2d7572f7c42ba562d96323bf0c51d386ad7cd67e65dd9b4e9`

See more details on using hashes here.

File details

Details for the file rag_knowledge_preparation-1.0.0-py3-none-any.whl.

File metadata

Download URL: rag_knowledge_preparation-1.0.0-py3-none-any.whl
Upload date: Oct 21, 2025
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rag_knowledge_preparation-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`13aaf77282b4423bcbe916063c8579f3fc284ebce784651f5403b0c296b3e171`
MD5	`e792165bb0726fb351f25a5fd967254b`
BLAKE2b-256	`5bac356ef669d00dbdb9f284d9ea007b514ccf63359c5a6f15d72530b8e1e0c8`

See more details on using hashes here.

rag-knowledge-preparation 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Knowledge Preparation Python

Features

Document Processing Features

Codebase Analysis Features

Configuration & Customization

Installation

Development Installation

Quick Start

Document Processing

Codebase Analysis

Document Processing Details

Supported Formats

Processing Presets

Basic Processing

Standard Document Processing

OCR-Heavy Processing

Table-Focused Processing

High-Quality Processing

Custom Configuration

Batch Processing

Codebase Analysis Usage

Basic Analysis

Export to Markdown

AI-Powered Analysis

Codebase Processing Presets

Minimal Processing

Standard Processing

Comprehensive Processing

Configuration Options

Advanced Features

Language Detection and Classification

Dependency Analysis

Code Structure Extraction

Token Estimation

Configuration Reference

Document Processing Configuration

Codebase Processing Configuration

Error Handling

Performance Considerations

Large File Processing

Performance Modes

Examples

Complete Document Processing Pipeline

Complete Codebase Analysis Pipeline

Acknowledgments

Changelog

Version 1.0.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes