Skip to main content

RAG Knowledge Preparation in Python

Project description

RAG Knowledge Preparation Python

A comprehensive Python library for preparing knowledge bases for Retrieval-Augmented Generation (RAG) systems. This library now focuses on Gemini OCR-based PDF-to-Markdown conversion alongside intelligent codebase analysis.

Features

Document Processing Features

  • Multi-format Intake: PDF, images (PNG/JPG/TIFF/BMP/WebP), DOCX, and text/Markdown/CSV
  • Gemini OCR: Convert PDFs/images to Markdown via Gemini 2.5 Pro multimodal
  • Strict Markdown Output: Page-by-page extraction with a table-aware prompt
  • Async & Parallel: Concurrency controls for multi-page PDFs
  • Batch Processing: Process multiple PDFs or entire folders efficiently
  • Configurable Quality: Presets for fast, table-focused, or high-quality OCR

Codebase Analysis Features

  • Comprehensive Analysis: Extract structure, dependencies, and metadata from codebases
  • Multi-language Support: Python, JavaScript, TypeScript and more
  • AI-Powered Summaries: Generate intelligent code summaries using Google Gemini
  • Project-aware Metadata: Capture project names, aliases, and file aliases for precise RAG context
  • Dependency Analysis: Identify and categorize internal, external, and standard library dependencies
  • Structure Extraction: Parse classes, functions, imports, and code organization
  • Token Estimation: Accurate token counting for RAG optimization

Configuration & Customization

  • Flexible Configuration: Extensive configuration options for both document and codebase processing
  • Preset Configurations: Pre-built configurations for common use cases
  • Custom Metadata: Configurable metadata fields for different analysis needs
  • Performance Optimization: Built-in performance modes for large-scale processing

Installation

Prerequisites

  • Poppler (for pdf2image): brew install poppler (macOS) or sudo apt-get install -y poppler-utils (Linux)
  • Gemini API key: set the GOOGLE_API_KEY environment variable (needed only for OCR on PDF/images)
pip install rag-knowledge-preparation-python

Development Installation

git clone 
cd rag-knowledge-preparation-python
pip install -e ".[dev]"

Quick Start

Document Processing

from rag_knowledge_preparation import (
    convert_document_to_markdown,
    convert_scanned_document_to_markdown,
    convert_documents_batch
)

# Convert a single document (GOOGLE_API_KEY env var must be set)
markdown_content = convert_document_to_markdown("document.pdf")

# Convert a scanned document with OCR
scanned_content = convert_scanned_document_to_markdown("scanned_document.pdf")

# Process multiple documents
results = convert_documents_batch(["doc1.pdf", "doc2.pdf"])

# DOCX/text/CSV/Markdown are handled locally (no API key needed)
docx_md = convert_document_to_markdown("report.docx")
notes_md = convert_document_to_markdown("notes.md")

# Images go through Gemini OCR (needs GOOGLE_API_KEY)
image_md = convert_document_to_markdown("whiteboard.png")

Codebase Analysis

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview
)

# Export entire codebase to Markdown
output_file = export_codebase_to_markdown("./my_project", "codebase_export.md")

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

# Get high-level overview
overview = get_codebase_overview("./my_project")

Document Processing Details

Supported Formats

  • PDF: Gemini OCR (rasterized to images under the hood)
  • Images: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP (Gemini OCR)
  • DOCX: Parsed to Markdown via python-docx (no OCR required)
  • Text/Markdown/CSV: Read directly with encoding auto-detection (no OCR required)

Processing Presets

Basic Processing

from rag_knowledge_preparation import convert_document_to_markdown

# Basic, lightweight OCR (lower DPI + fewer tokens)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="basic"
)

Standard Document Processing

# Balanced Gemini OCR (default prompt/DPI)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="standard"
)

OCR-Heavy Processing

# Higher DPI and retries for tough scans
content = convert_document_to_markdown(
    "scanned_document.pdf", 
    processing_preset="ocr_heavy"
)

Table-Focused Processing

# Table-aware prompt for documents with dense tabular content
content = convert_document_to_markdown(
    "data_heavy_document.pdf", 
    processing_preset="table_focused"
)

High-Quality Processing

# Maximum quality with highest DPI and token limits
content = convert_document_to_markdown(
    "important_document.pdf", 
    processing_preset="high_quality"
)

Custom Configuration

from rag_knowledge_preparation import convert_document_to_markdown

# Custom configuration
content = convert_document_to_markdown(
    "document.pdf",
    processing_preset="standard",
    dpi=350,
    page_selection="1-5,8",
    temperature=0.15,
    max_output_tokens=5000
)

Batch Processing

from rag_knowledge_preparation import convert_documents_batch, convert_folder_to_markdown

# Process multiple files
results = convert_documents_batch([
    "document1.pdf",
    "document2.pdf"
])

# Process entire folder
folder_results = convert_folder_to_markdown("./documents/")

Working with non-PDF inputs

# Images -> Gemini OCR (needs GOOGLE_API_KEY)
image_markdown = convert_document_to_markdown("whiteboard.png")

# DOCX -> parsed locally, no OCR/API key required
docx_markdown = convert_document_to_markdown("report.docx")

# Text/Markdown/CSV -> pass-through
notes_markdown = convert_document_to_markdown("notes.txt")

Folder/batch helpers (convert_documents_batch, convert_folder_to_markdown) automatically pick up all supported extensions.

Codebase Analysis Usage

Basic Analysis

from rag_knowledge_preparation import analyze_codebase_structure

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

print(f"Total files: {structure['total_files']}")
print(f"Total lines: {structure['total_lines']}")
print(f"Languages: {structure['languages']}")

Export to Markdown

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with default settings
output_file = export_codebase_to_markdown("./my_project")

# Export with custom output file
output_file = export_codebase_to_markdown(
    "./my_project", 
    output_file="my_codebase.md"
)

AI-Powered Analysis

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with AI summaries (requires Gemini API key)
output_file = export_codebase_to_markdown(
    "./my_project",
    gemini_api_key="your-google-api-key",
    gemini_model="gemini-2.5-flash"
)

Codebase Processing Presets

Minimal Processing

from rag_knowledge_preparation import export_codebase_to_markdown

# Minimal processing - basic analysis only
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="minimal"
)

Standard Processing

# Standard processing with full analysis
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="standard"
)

Comprehensive Processing

# Comprehensive processing with all features
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="comprehensive"
)

Configuration Options

from rag_knowledge_preparation import (
    CodebaseProcessingConfig,
    MetadataConfig,
    export_codebase_to_markdown
)

# Custom configuration
config = CodebaseProcessingConfig(
    max_file_size_mb=2.0,
    include_test_files=False,
    include_documentation=True,
    enable_ai_summary=True,
    gemini_api_key="your-api-key",
    custom_ignore_patterns=["*.log", "temp/*"]
)

# Custom metadata configuration
metadata_config = MetadataConfig(
    include_file_path=True,
    include_language=True,
    include_purpose=True,
    include_dependencies=True,
    include_structure=True,
    include_summary=True
)

config.metadata_config = metadata_config

# Use custom configuration
output_file = export_codebase_to_markdown(
    "./my_project",
    processing_preset="standard",  # apply overrides on top of the standard preset
    **config.model_dump()
)

Project-aware metadata & aliases

You can enrich every exported file with project context so downstream RAG systems can ground answers:

from rag_knowledge_preparation import CodebaseProcessingConfig, MetadataConfig

config = CodebaseProcessingConfig(
    project_name="EIC AI Knowledge Utils",
    project_aliases=["EIC-AI", "Knowledge Utils"],
    project_description="Utilities that prep internal knowledge for RAG pipelines.",
    metadata_config=MetadataConfig(
        include_project_description=True,
        include_project_aliases=True,
        include_file_aliases=True
    )
)

The exporter now injects the project name, aliases, optional description, and a set of handy file aliases (for example, Project::path/to/file.py). The Gemini prompt receives this context, yet the summaries stay concise because the Metadata block already lists project and path information.

MetadataConfig ships with four new toggles (include_project_name, include_project_aliases, include_project_description, include_file_aliases) that default to True (description defaults to False). Disable them if you prefer leaner metadata blocks.

Advanced Features

Language Detection and Classification

The library automatically detects programming languages and classifies files by purpose:

from rag_knowledge_preparation.codebase_processing.analysis import (
    get_language_from_extension,
    classify_file_by_purpose
)

# Detect language from file extension
language = get_language_from_extension("script.py")  # Returns "python"

# Classify file by purpose
purpose = classify_file_by_purpose("test_utils.py")  # Returns "Tests"

Dependency Analysis

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import analyze_file_dependencies

# Analyze dependencies in a Python file
with open("main.py", "r") as f:
    content = f.read()
dependencies = analyze_file_dependencies(content, Path("main.py"), "python")

print("External packages:", dependencies["external_packages"])
print("Standard library:", dependencies["standard_library"])
print("Internal modules:", dependencies["internal_modules"])

Code Structure Extraction

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import extract_code_structure

# Extract structure from code file
code_content = """
class MyClass:
    def __init__(self):
        pass
    
    def method(self):
        pass
"""
structure = extract_code_structure(Path("example.py"), "python", code_content)

print("Classes:", structure["classes"])
print("Functions:", structure["functions"])

Token Estimation

from rag_knowledge_preparation.codebase_processing.analysis import estimate_token_count

# Estimate tokens in text
token_count = estimate_token_count("Hello, world!")
print(f"Estimated tokens: {token_count}")

# Estimate tokens in code
code_tokens = estimate_token_count("""
def hello():
    print("Hello, world!")
""")

Configuration Reference

Document Processing Configuration

Parameter Type Default Description
model_name str "gemini-2.5-pro" Gemini multimodal model for OCR
prompt str Markdown prompt Per-page OCR extraction prompt
temperature float 0.2 Model temperature
max_output_tokens int 4096 Max tokens per page generation
dpi int 300 DPI used when rasterizing PDFs
page_selection Optional[str] None Page ranges, e.g. "1-3,5"
parallel_concurrency int 5 Pages processed concurrently
max_retries int 4 Retry attempts for transient errors

Codebase Processing Configuration

Parameter Type Default Description
max_file_size_mb float 1.0 Maximum file size to process
include_hidden_files bool False Include hidden files
include_test_files bool True Include test files
include_documentation bool True Include documentation files
include_config_files bool True Include configuration files
include_static_assets bool False Include binary/static assets (images, fonts, etc.)
enable_structure_analysis bool True Enable code structure analysis
enable_ai_summary bool True Enable AI-powered summaries
gemini_api_key str None Google Gemini API key
gemini_model str "gemini-2.5-flash" Gemini model to use
custom_ignore_patterns List[str] None Custom ignore patterns
project_name Optional[str] None Override for the primary project name used in summaries and metadata
project_aliases List[str] [] Additional aliases that will also be emitted in metadata
project_description Optional[str] None Short description included in metadata when enabled
duplicate_tracker Optional[DuplicateTracker] None Track duplicate files across multiple exports
duplicate_content_strategy "full"/"link" "full" Replace repeated file content with a link to the shared report
exclude_directories List[str] None Directory names to skip entirely (case-insensitive)
exclude_file_extensions List[str] None Extensions (e.g. .log) to skip

Codebase Presets

Preset Description Key Differences
minimal Focus on essential source files only Skips tests/docs/configs, disables AI summaries, emits only file path + language metadata
standard Balanced default for most repos Includes tests/docs/configs, AI summaries enabled, metadata covers project/file aliases and structure
comprehensive Deep dive for large audits Higher size limit (5 MB), hidden files allowed, emits every metadata field (dates, encoding, git info, etc.)

Use processing_preset="<name>" when calling export_codebase_to_markdown. You can still override any field via **CodebaseProcessingConfig(...).model_dump() if a preset needs tweaks.

Tracking shared files

You can collect repeated scripts/configs that appear across many projects using DuplicateTracker and optionally replace duplicated content with shared references:

from rag_knowledge_preparation import export_codebase_to_markdown, CodebaseProcessingConfig
from rag_knowledge_preparation.codebase_processing.utils import DuplicateTracker

tracker = DuplicateTracker(min_occurrences=2)

for project in projects:
    export_codebase_to_markdown(
        project,
        output_file=f"{project.name}.md",
        processing_preset="standard",
        duplicate_tracker=tracker,
        duplicate_content_strategy="link",  # swap file bodies with references to shared digest
    )

if tracker.has_duplicates():
    tracker.write_markdown_report("shared_files.md")

The generated shared_files.md lists every repeated file, its digest, language, and all project locations, so you can link to a single canonical snippet instead of duplicating boilerplate. Duplicate detection uses a raw hash of the file contents (byte-for-byte match). Set duplicate_content_strategy="link" in CodebaseProcessingConfig to replace duplicate files in the per-project exports with a short sentence that points to the shared digest instead of embedding their full content.

Error Handling

The library provides comprehensive error handling with custom exceptions:

from rag_knowledge_preparation import (
    RAGKnowledgePreparationError,
    DocumentNotFoundError,
    ConfigurationError,
    ConversionError,
    UnsupportedFormatError
)

try:
    content = convert_document_to_markdown("nonexistent.pdf")
except DocumentNotFoundError as e:
    print(f"Document not found: {e}")
except ConversionError as e:
    print(f"Conversion failed: {e}")
except ConfigurationError as e:
    print(f"Configuration error: {e}")

Performance Considerations

Large File Processing

The library includes built-in optimizations for large files:

  • File Size Limits: Configurable maximum file size limits
  • Memory Efficiency: Streaming processing for large documents
  • Batch Processing: Efficient processing of multiple files
  • Parallel Processing: Concurrent processing where possible

Performance Modes

# Use performance-optimized settings
config = CodebaseProcessingConfig(
    max_file_size_mb=0.5,  # Smaller file limit
    enable_ai_summary=False,  # Disable AI for speed
    enable_structure_analysis=False  # Disable structure analysis
)

Examples

Complete Document Processing Pipeline

from rag_knowledge_preparation import (
    convert_folder_to_markdown,
    list_document_configs
)

# List available configurations
configs = list_document_configs()
print("Available configurations:", list(configs.keys()))

# Process entire document folder
results = convert_folder_to_markdown(
    "./documents/",
    processing_preset="high_quality"
)

# Save results
for file_path, content in results.items():
    output_path = f"processed_{file_path.split('/')[-1]}.md"
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)

Complete Codebase Analysis Pipeline

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview,
    list_available_codebase_configs
)

# List available configurations
configs = list_available_codebase_configs()
print("Available configurations:", list(configs.keys()))

# Get overview
overview = get_codebase_overview("./my_project")
print(f"Project: {overview['name']}")
print(f"Files: {overview['total_files']}")
print(f"Languages: {overview['languages']}")

# Analyze structure
structure = analyze_codebase_structure("./my_project")
print(f"Structure analysis complete: {structure['total_files']} files processed")

# Export to Markdown
output_file = export_codebase_to_markdown(
    "./my_project",
    output_file="project_analysis.md",
    gemini_api_key="your-api-key"
)
print(f"Exported to: {output_file}")

Acknowledgments

Changelog

Version 1.0.0

  • Initial release
  • Document processing with OCR support
  • Codebase analysis and export
  • AI-powered summarization
  • Comprehensive configuration options
  • Multi-language support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_knowledge_preparation-1.0.3.tar.gz (48.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_knowledge_preparation-1.0.3-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file rag_knowledge_preparation-1.0.3.tar.gz.

File metadata

File hashes

Hashes for rag_knowledge_preparation-1.0.3.tar.gz
Algorithm Hash digest
SHA256 3efd2f926872051e8e8e0bb3ce2038b4f0471b2ab23a1a5c2b3f5c5e28137d52
MD5 5eadf8561a5e2e7f03ebbf585f57d866
BLAKE2b-256 3aeddb16b942358a33898feba4bb24a08dc09dc2a1de9207b2b89a7ec00adc64

See more details on using hashes here.

File details

Details for the file rag_knowledge_preparation-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_knowledge_preparation-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5167bd20913709fde3c7561b5a27c85552ea7a479ba4c056910b25fc08143f0a
MD5 ee5bb716b0d5f08703f9e1bd053a0019
BLAKE2b-256 156d2fd6887ae44b56e5dab2111a68a1c47d815787a5d83055ee0a772d39b414

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page