Skip to main content

RAG Knowledge Preparation in Python

Project description

RAG Knowledge Preparation Python

A comprehensive Python library for preparing knowledge bases for Retrieval-Augmented Generation (RAG) systems. This library now focuses on Gemini OCR-based PDF-to-Markdown conversion alongside intelligent codebase analysis.

Features

Document Processing Features

  • Gemini OCR: Convert PDFs to Markdown via Gemini 2.5 Pro multimodal
  • Strict Markdown Output: Page-by-page extraction with a table-aware prompt
  • Async & Parallel: Concurrency controls for multi-page PDFs
  • Batch Processing: Process multiple PDFs or entire folders efficiently
  • Configurable Quality: Presets for fast, table-focused, or high-quality OCR

Codebase Analysis Features

  • Comprehensive Analysis: Extract structure, dependencies, and metadata from codebases
  • Multi-language Support: Python, JavaScript, TypeScript and more
  • AI-Powered Summaries: Generate intelligent code summaries using Google Gemini
  • Project-aware Metadata: Capture project names, aliases, and file aliases for precise RAG context
  • Dependency Analysis: Identify and categorize internal, external, and standard library dependencies
  • Structure Extraction: Parse classes, functions, imports, and code organization
  • Token Estimation: Accurate token counting for RAG optimization

Configuration & Customization

  • Flexible Configuration: Extensive configuration options for both document and codebase processing
  • Preset Configurations: Pre-built configurations for common use cases
  • Custom Metadata: Configurable metadata fields for different analysis needs
  • Performance Optimization: Built-in performance modes for large-scale processing

Installation

Prerequisites

  • Poppler (for pdf2image): brew install poppler (macOS) or sudo apt-get install -y poppler-utils (Linux)
  • Gemini API key: set the GOOGLE_API_KEY environment variable
pip install rag-knowledge-preparation-python

Development Installation

git clone 
cd rag-knowledge-preparation-python
pip install -e ".[dev]"

Quick Start

Document Processing

from rag_knowledge_preparation import (
    convert_document_to_markdown,
    convert_scanned_document_to_markdown,
    convert_documents_batch
)

# Convert a single document (GOOGLE_API_KEY env var must be set)
markdown_content = convert_document_to_markdown("document.pdf")

# Convert a scanned document with OCR
scanned_content = convert_scanned_document_to_markdown("scanned_document.pdf")

# Process multiple documents
results = convert_documents_batch(["doc1.pdf", "doc2.pdf"])

Codebase Analysis

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview
)

# Export entire codebase to Markdown
output_file = export_codebase_to_markdown("./my_project", "codebase_export.md")

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

# Get high-level overview
overview = get_codebase_overview("./my_project")

Document Processing Details

Supported Formats

  • PDF: Gemini OCR currently supports PDF files (rasterized to images under the hood)

Processing Presets

Basic Processing

from rag_knowledge_preparation import convert_document_to_markdown

# Basic, lightweight OCR (lower DPI + fewer tokens)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="basic"
)

Standard Document Processing

# Balanced Gemini OCR (default prompt/DPI)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="standard"
)

OCR-Heavy Processing

# Higher DPI and retries for tough scans
content = convert_document_to_markdown(
    "scanned_document.pdf", 
    processing_preset="ocr_heavy"
)

Table-Focused Processing

# Table-aware prompt for documents with dense tabular content
content = convert_document_to_markdown(
    "data_heavy_document.pdf", 
    processing_preset="table_focused"
)

High-Quality Processing

# Maximum quality with highest DPI and token limits
content = convert_document_to_markdown(
    "important_document.pdf", 
    processing_preset="high_quality"
)

Custom Configuration

from rag_knowledge_preparation import convert_document_to_markdown

# Custom configuration
content = convert_document_to_markdown(
    "document.pdf",
    processing_preset="standard",
    dpi=350,
    page_selection="1-5,8",
    temperature=0.15,
    max_output_tokens=5000
)

Batch Processing

from rag_knowledge_preparation import convert_documents_batch, convert_folder_to_markdown

# Process multiple files
results = convert_documents_batch([
    "document1.pdf",
    "document2.pdf"
])

# Process entire folder
folder_results = convert_folder_to_markdown("./documents/")

Codebase Analysis Usage

Basic Analysis

from rag_knowledge_preparation import analyze_codebase_structure

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

print(f"Total files: {structure['total_files']}")
print(f"Total lines: {structure['total_lines']}")
print(f"Languages: {structure['languages']}")

Export to Markdown

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with default settings
output_file = export_codebase_to_markdown("./my_project")

# Export with custom output file
output_file = export_codebase_to_markdown(
    "./my_project", 
    output_file="my_codebase.md"
)

AI-Powered Analysis

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with AI summaries (requires Gemini API key)
output_file = export_codebase_to_markdown(
    "./my_project",
    gemini_api_key="your-google-api-key",
    gemini_model="gemini-2.5-flash"
)

Codebase Processing Presets

Minimal Processing

from rag_knowledge_preparation import export_codebase_to_markdown

# Minimal processing - basic analysis only
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="minimal"
)

Standard Processing

# Standard processing with full analysis
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="standard"
)

Comprehensive Processing

# Comprehensive processing with all features
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="comprehensive"
)

Configuration Options

from rag_knowledge_preparation import (
    CodebaseProcessingConfig,
    MetadataConfig,
    export_codebase_to_markdown
)

# Custom configuration
config = CodebaseProcessingConfig(
    max_file_size_mb=2.0,
    include_test_files=False,
    include_documentation=True,
    enable_ai_summary=True,
    gemini_api_key="your-api-key",
    custom_ignore_patterns=["*.log", "temp/*"]
)

# Custom metadata configuration
metadata_config = MetadataConfig(
    include_file_path=True,
    include_language=True,
    include_purpose=True,
    include_dependencies=True,
    include_structure=True,
    include_summary=True
)

config.metadata_config = metadata_config

# Use custom configuration
output_file = export_codebase_to_markdown(
    "./my_project",
    processing_preset="standard",  # apply overrides on top of the standard preset
    **config.model_dump()
)

Project-aware metadata & aliases

You can enrich every exported file with project context so downstream RAG systems can ground answers:

from rag_knowledge_preparation import CodebaseProcessingConfig, MetadataConfig

config = CodebaseProcessingConfig(
    project_name="EIC AI Knowledge Utils",
    project_aliases=["EIC-AI", "Knowledge Utils"],
    project_description="Utilities that prep internal knowledge for RAG pipelines.",
    metadata_config=MetadataConfig(
        include_project_description=True,
        include_project_aliases=True,
        include_file_aliases=True
    )
)

The exporter now injects the project name, aliases, optional description, and a set of handy file aliases (for example, Project::path/to/file.py). The Gemini prompt receives this context, yet the summaries stay concise because the Metadata block already lists project and path information.

MetadataConfig ships with four new toggles (include_project_name, include_project_aliases, include_project_description, include_file_aliases) that default to True (description defaults to False). Disable them if you prefer leaner metadata blocks.

Advanced Features

Language Detection and Classification

The library automatically detects programming languages and classifies files by purpose:

from rag_knowledge_preparation.codebase_processing.analysis import (
    get_language_from_extension,
    classify_file_by_purpose
)

# Detect language from file extension
language = get_language_from_extension("script.py")  # Returns "python"

# Classify file by purpose
purpose = classify_file_by_purpose("test_utils.py")  # Returns "Tests"

Dependency Analysis

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import analyze_file_dependencies

# Analyze dependencies in a Python file
with open("main.py", "r") as f:
    content = f.read()
dependencies = analyze_file_dependencies(content, Path("main.py"), "python")

print("External packages:", dependencies["external_packages"])
print("Standard library:", dependencies["standard_library"])
print("Internal modules:", dependencies["internal_modules"])

Code Structure Extraction

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import extract_code_structure

# Extract structure from code file
code_content = """
class MyClass:
    def __init__(self):
        pass
    
    def method(self):
        pass
"""
structure = extract_code_structure(Path("example.py"), "python", code_content)

print("Classes:", structure["classes"])
print("Functions:", structure["functions"])

Token Estimation

from rag_knowledge_preparation.codebase_processing.analysis import estimate_token_count

# Estimate tokens in text
token_count = estimate_token_count("Hello, world!")
print(f"Estimated tokens: {token_count}")

# Estimate tokens in code
code_tokens = estimate_token_count("""
def hello():
    print("Hello, world!")
""")

Configuration Reference

Document Processing Configuration

Parameter Type Default Description
model_name str "gemini-2.5-pro" Gemini multimodal model for OCR
prompt str Markdown prompt Per-page OCR extraction prompt
temperature float 0.2 Model temperature
max_output_tokens int 4096 Max tokens per page generation
dpi int 300 DPI used when rasterizing PDFs
page_selection Optional[str] None Page ranges, e.g. "1-3,5"
parallel_concurrency int 5 Pages processed concurrently
max_retries int 4 Retry attempts for transient errors

Codebase Processing Configuration

Parameter Type Default Description
max_file_size_mb float 1.0 Maximum file size to process
include_hidden_files bool False Include hidden files
include_test_files bool True Include test files
include_documentation bool True Include documentation files
include_config_files bool True Include configuration files
include_static_assets bool False Include binary/static assets (images, fonts, etc.)
enable_structure_analysis bool True Enable code structure analysis
enable_ai_summary bool True Enable AI-powered summaries
gemini_api_key str None Google Gemini API key
gemini_model str "gemini-2.5-flash" Gemini model to use
custom_ignore_patterns List[str] None Custom ignore patterns
project_name Optional[str] None Override for the primary project name used in summaries and metadata
project_aliases List[str] [] Additional aliases that will also be emitted in metadata
project_description Optional[str] None Short description included in metadata when enabled
duplicate_tracker Optional[DuplicateTracker] None Track duplicate files across multiple exports
duplicate_content_strategy "full"/"link" "full" Replace repeated file content with a link to the shared report
exclude_directories List[str] None Directory names to skip entirely (case-insensitive)
exclude_file_extensions List[str] None Extensions (e.g. .log) to skip

Codebase Presets

Preset Description Key Differences
minimal Focus on essential source files only Skips tests/docs/configs, disables AI summaries, emits only file path + language metadata
standard Balanced default for most repos Includes tests/docs/configs, AI summaries enabled, metadata covers project/file aliases and structure
comprehensive Deep dive for large audits Higher size limit (5 MB), hidden files allowed, emits every metadata field (dates, encoding, git info, etc.)

Use processing_preset="<name>" when calling export_codebase_to_markdown. You can still override any field via **CodebaseProcessingConfig(...).model_dump() if a preset needs tweaks.

Tracking shared files

You can collect repeated scripts/configs that appear across many projects using DuplicateTracker and optionally replace duplicated content with shared references:

from rag_knowledge_preparation import export_codebase_to_markdown, CodebaseProcessingConfig
from rag_knowledge_preparation.codebase_processing.utils import DuplicateTracker

tracker = DuplicateTracker(min_occurrences=2)

for project in projects:
    export_codebase_to_markdown(
        project,
        output_file=f"{project.name}.md",
        processing_preset="standard",
        duplicate_tracker=tracker,
        duplicate_content_strategy="link",  # swap file bodies with references to shared digest
    )

if tracker.has_duplicates():
    tracker.write_markdown_report("shared_files.md")

The generated shared_files.md lists every repeated file, its digest, language, and all project locations, so you can link to a single canonical snippet instead of duplicating boilerplate. Duplicate detection uses a raw hash of the file contents (byte-for-byte match). Set duplicate_content_strategy="link" in CodebaseProcessingConfig to replace duplicate files in the per-project exports with a short sentence that points to the shared digest instead of embedding their full content.

Error Handling

The library provides comprehensive error handling with custom exceptions:

from rag_knowledge_preparation import (
    RAGKnowledgePreparationError,
    DocumentNotFoundError,
    ConfigurationError,
    ConversionError,
    UnsupportedFormatError
)

try:
    content = convert_document_to_markdown("nonexistent.pdf")
except DocumentNotFoundError as e:
    print(f"Document not found: {e}")
except ConversionError as e:
    print(f"Conversion failed: {e}")
except ConfigurationError as e:
    print(f"Configuration error: {e}")

Performance Considerations

Large File Processing

The library includes built-in optimizations for large files:

  • File Size Limits: Configurable maximum file size limits
  • Memory Efficiency: Streaming processing for large documents
  • Batch Processing: Efficient processing of multiple files
  • Parallel Processing: Concurrent processing where possible

Performance Modes

# Use performance-optimized settings
config = CodebaseProcessingConfig(
    max_file_size_mb=0.5,  # Smaller file limit
    enable_ai_summary=False,  # Disable AI for speed
    enable_structure_analysis=False  # Disable structure analysis
)

Examples

Complete Document Processing Pipeline

from rag_knowledge_preparation import (
    convert_folder_to_markdown,
    list_document_configs
)

# List available configurations
configs = list_document_configs()
print("Available configurations:", list(configs.keys()))

# Process entire document folder
results = convert_folder_to_markdown(
    "./documents/",
    processing_preset="high_quality"
)

# Save results
for file_path, content in results.items():
    output_path = f"processed_{file_path.split('/')[-1]}.md"
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)

Complete Codebase Analysis Pipeline

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview,
    list_available_codebase_configs
)

# List available configurations
configs = list_available_codebase_configs()
print("Available configurations:", list(configs.keys()))

# Get overview
overview = get_codebase_overview("./my_project")
print(f"Project: {overview['name']}")
print(f"Files: {overview['total_files']}")
print(f"Languages: {overview['languages']}")

# Analyze structure
structure = analyze_codebase_structure("./my_project")
print(f"Structure analysis complete: {structure['total_files']} files processed")

# Export to Markdown
output_file = export_codebase_to_markdown(
    "./my_project",
    output_file="project_analysis.md",
    gemini_api_key="your-api-key"
)
print(f"Exported to: {output_file}")

Acknowledgments

Changelog

Version 1.0.0

  • Initial release
  • Document processing with OCR support
  • Codebase analysis and export
  • AI-powered summarization
  • Comprehensive configuration options
  • Multi-language support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_knowledge_preparation-1.0.1.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_knowledge_preparation-1.0.1-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file rag_knowledge_preparation-1.0.1.tar.gz.

File metadata

File hashes

Hashes for rag_knowledge_preparation-1.0.1.tar.gz
Algorithm Hash digest
SHA256 48b5605725099b0336c9d71bad59b79f558602b5fb9168315819c929611865ab
MD5 0c057c732e6dc1dc5f7fa81f370c41d7
BLAKE2b-256 21f6e05456f7ae72802a597a9e160a0487c56856bc4308a8cfd50e77a44f748d

See more details on using hashes here.

File details

Details for the file rag_knowledge_preparation-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_knowledge_preparation-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3d219dd0ba2b9dbba9cda5b8db4a65739a6e348e92142da57ba04e59b3df5529
MD5 531f4ee1a2e3e29373ff636504379669
BLAKE2b-256 bba8ff40bc3d8e8207f15f673f84580b770c25109dcd0cb59835090465ea565c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page