RAG Knowledge Preparation in Python

These details have not been verified by PyPI

Project links

Homepage

Project description

RAG Knowledge Preparation Python

A comprehensive Python library for preparing knowledge bases for Retrieval-Augmented Generation (RAG) systems. This library now focuses on Gemini OCR-based PDF-to-Markdown conversion alongside intelligent codebase analysis.

Features

Document Processing Features

Multi-format Intake: PDF, images (PNG/JPG/TIFF/BMP/WebP), DOCX, and text/Markdown/CSV
Gemini OCR: Convert PDFs/images to Markdown via Gemini 2.5 Pro multimodal
Strict Markdown Output: Page-by-page extraction with a table-aware prompt
Async & Parallel: Concurrency controls for multi-page PDFs
Batch Processing: Process multiple PDFs or entire folders efficiently
Configurable Quality: Presets for fast, table-focused, or high-quality OCR

Codebase Analysis Features

Comprehensive Analysis: Extract structure, dependencies, and metadata from codebases
Multi-language Support: Python, JavaScript, TypeScript and more
AI-Powered Summaries: Generate intelligent code summaries using Google Gemini
Project-aware Metadata: Capture project names, aliases, and file aliases for precise RAG context
Dependency Analysis: Identify and categorize internal, external, and standard library dependencies
Structure Extraction: Parse classes, functions, imports, and code organization
Token Estimation: Accurate token counting for RAG optimization

Configuration & Customization

Flexible Configuration: Extensive configuration options for both document and codebase processing
Preset Configurations: Pre-built configurations for common use cases
Custom Metadata: Configurable metadata fields for different analysis needs
Performance Optimization: Built-in performance modes for large-scale processing

Installation

Prerequisites

Poppler (for pdf2image): brew install poppler (macOS) or sudo apt-get install -y poppler-utils (Linux)
Gemini API key: set the GOOGLE_API_KEY environment variable (needed only for OCR on PDF/images)

pip install rag-knowledge-preparation-python

Development Installation

git clone 
cd rag-knowledge-preparation-python
pip install -e ".[dev]"

Quick Start

Document Processing

from rag_knowledge_preparation import (
    convert_document_to_markdown,
    convert_scanned_document_to_markdown,
    convert_documents_batch
)

# Convert a single document (GOOGLE_API_KEY env var must be set)
markdown_content = convert_document_to_markdown("document.pdf")

# Convert a scanned document with OCR
scanned_content = convert_scanned_document_to_markdown("scanned_document.pdf")

# Process multiple documents
results = convert_documents_batch(["doc1.pdf", "doc2.pdf"])

# DOCX/text/CSV/Markdown are handled locally (no API key needed)
docx_md = convert_document_to_markdown("report.docx")
notes_md = convert_document_to_markdown("notes.md")

# Images go through Gemini OCR (needs GOOGLE_API_KEY)
image_md = convert_document_to_markdown("whiteboard.png")

Codebase Analysis

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview
)

# Export entire codebase to Markdown
output_file = export_codebase_to_markdown("./my_project", "codebase_export.md")

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

# Get high-level overview
overview = get_codebase_overview("./my_project")

Document Processing Details

Supported Formats

PDF: Gemini OCR (rasterized to images under the hood)
Images: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP (Gemini OCR)
DOCX: Parsed to Markdown via python-docx (no OCR required)
Text/Markdown/CSV: Read directly with encoding auto-detection (no OCR required)

Processing Presets

Basic Processing

from rag_knowledge_preparation import convert_document_to_markdown

# Basic, lightweight OCR (lower DPI + fewer tokens)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="basic"
)

Standard Document Processing

# Balanced Gemini OCR (default prompt/DPI)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="standard"
)

OCR-Heavy Processing

# Higher DPI and retries for tough scans
content = convert_document_to_markdown(
    "scanned_document.pdf", 
    processing_preset="ocr_heavy"
)

Table-Focused Processing

# Table-aware prompt for documents with dense tabular content
content = convert_document_to_markdown(
    "data_heavy_document.pdf", 
    processing_preset="table_focused"
)

High-Quality Processing

# Maximum quality with highest DPI and token limits
content = convert_document_to_markdown(
    "important_document.pdf", 
    processing_preset="high_quality"
)

Custom Configuration

from rag_knowledge_preparation import convert_document_to_markdown

# Custom configuration
content = convert_document_to_markdown(
    "document.pdf",
    processing_preset="standard",
    dpi=350,
    page_selection="1-5,8",
    temperature=0.15,
    max_output_tokens=5000
)

Batch Processing

from rag_knowledge_preparation import convert_documents_batch, convert_folder_to_markdown

# Process multiple files
results = convert_documents_batch([
    "document1.pdf",
    "document2.pdf"
])

# Process entire folder
folder_results = convert_folder_to_markdown("./documents/")

Working with non-PDF inputs

# Images -> Gemini OCR (needs GOOGLE_API_KEY)
image_markdown = convert_document_to_markdown("whiteboard.png")

# DOCX -> parsed locally, no OCR/API key required
docx_markdown = convert_document_to_markdown("report.docx")

# Text/Markdown/CSV -> pass-through
notes_markdown = convert_document_to_markdown("notes.txt")

Folder/batch helpers (convert_documents_batch, convert_folder_to_markdown) automatically pick up all supported extensions.

Codebase Analysis Usage

Basic Analysis

from rag_knowledge_preparation import analyze_codebase_structure

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

print(f"Total files: {structure['total_files']}")
print(f"Total lines: {structure['total_lines']}")
print(f"Languages: {structure['languages']}")

Export to Markdown

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with default settings
output_file = export_codebase_to_markdown("./my_project")

# Export with custom output file
output_file = export_codebase_to_markdown(
    "./my_project", 
    output_file="my_codebase.md"
)

AI-Powered Analysis

from rag_knowledge_preparation import export_codebase_to_markdown

# Export with AI summaries (requires Gemini API key)
output_file = export_codebase_to_markdown(
    "./my_project",
    gemini_api_key="your-google-api-key",
    gemini_model="gemini-2.5-flash"
)

Codebase Processing Presets

Minimal Processing

from rag_knowledge_preparation import export_codebase_to_markdown

# Minimal processing - basic analysis only
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="minimal"
)

Standard Processing

# Standard processing with full analysis
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="standard"
)

Comprehensive Processing

# Comprehensive processing with all features
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="comprehensive"
)

Configuration Options

from rag_knowledge_preparation import (
    CodebaseProcessingConfig,
    MetadataConfig,
    export_codebase_to_markdown
)

# Custom configuration
config = CodebaseProcessingConfig(
    max_file_size_mb=2.0,
    include_test_files=False,
    include_documentation=True,
    enable_ai_summary=True,
    gemini_api_key="your-api-key",
    custom_ignore_patterns=["*.log", "temp/*"]
)

# Custom metadata configuration
metadata_config = MetadataConfig(
    include_file_path=True,
    include_language=True,
    include_purpose=True,
    include_dependencies=True,
    include_structure=True,
    include_summary=True
)

config.metadata_config = metadata_config

# Use custom configuration
output_file = export_codebase_to_markdown(
    "./my_project",
    processing_preset="standard",  # apply overrides on top of the standard preset
    **config.model_dump()
)

Project-aware metadata & aliases

You can enrich every exported file with project context so downstream RAG systems can ground answers:

from rag_knowledge_preparation import CodebaseProcessingConfig, MetadataConfig

config = CodebaseProcessingConfig(
    project_name="EIC AI Knowledge Utils",
    project_aliases=["EIC-AI", "Knowledge Utils"],
    project_description="Utilities that prep internal knowledge for RAG pipelines.",
    metadata_config=MetadataConfig(
        include_project_description=True,
        include_project_aliases=True,
        include_file_aliases=True
    )
)

The exporter now injects the project name, aliases, optional description, and a set of handy file aliases (for example, Project::path/to/file.py). The Gemini prompt receives this context, yet the summaries stay concise because the Metadata block already lists project and path information.

MetadataConfig ships with four new toggles (include_project_name, include_project_aliases, include_project_description, include_file_aliases) that default to True (description defaults to False). Disable them if you prefer leaner metadata blocks.

Advanced Features

Language Detection and Classification

The library automatically detects programming languages and classifies files by purpose:

from rag_knowledge_preparation.codebase_processing.analysis import (
    get_language_from_extension,
    classify_file_by_purpose
)

# Detect language from file extension
language = get_language_from_extension("script.py")  # Returns "python"

# Classify file by purpose
purpose = classify_file_by_purpose("test_utils.py")  # Returns "Tests"

Dependency Analysis

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import analyze_file_dependencies

# Analyze dependencies in a Python file
with open("main.py", "r") as f:
    content = f.read()
dependencies = analyze_file_dependencies(content, Path("main.py"), "python")

print("External packages:", dependencies["external_packages"])
print("Standard library:", dependencies["standard_library"])
print("Internal modules:", dependencies["internal_modules"])

Code Structure Extraction

from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import extract_code_structure

# Extract structure from code file
code_content = """
class MyClass:
    def __init__(self):
        pass
    
    def method(self):
        pass
"""
structure = extract_code_structure(Path("example.py"), "python", code_content)

print("Classes:", structure["classes"])
print("Functions:", structure["functions"])

Token Estimation

from rag_knowledge_preparation.codebase_processing.analysis import estimate_token_count

# Estimate tokens in text
token_count = estimate_token_count("Hello, world!")
print(f"Estimated tokens: {token_count}")

# Estimate tokens in code
code_tokens = estimate_token_count("""
def hello():
    print("Hello, world!")
""")

Configuration Reference

Document Processing Configuration

Parameter	Type	Default	Description
`model_name`	str	"gemini-2.5-pro"	Gemini multimodal model for OCR
`prompt`	str	Markdown prompt	Per-page OCR extraction prompt
`temperature`	float	0.2	Model temperature
`max_output_tokens`	int	4096	Max tokens per page generation
`dpi`	int	300	DPI used when rasterizing PDFs
`page_selection`	Optional[str]	None	Page ranges, e.g. `"1-3,5"`
`parallel_concurrency`	int	5	Pages processed concurrently
`max_retries`	int	4	Retry attempts for transient errors

Codebase Processing Configuration

Parameter	Type	Default	Description
`max_file_size_mb`	float	1.0	Maximum file size to process
`include_hidden_files`	bool	False	Include hidden files
`include_test_files`	bool	True	Include test files
`include_documentation`	bool	True	Include documentation files
`include_config_files`	bool	True	Include configuration files
`include_static_assets`	bool	False	Include binary/static assets (images, fonts, etc.)
`enable_structure_analysis`	bool	True	Enable code structure analysis
`enable_ai_summary`	bool	True	Enable AI-powered summaries
`gemini_api_key`	str	None	Google Gemini API key
`gemini_model`	str	"gemini-2.5-flash"	Gemini model to use
`custom_ignore_patterns`	List[str]	None	Custom ignore patterns
`project_name`	Optional[str]	None	Override for the primary project name used in summaries and metadata
`project_aliases`	List[str]	[]	Additional aliases that will also be emitted in metadata
`project_description`	Optional[str]	None	Short description included in metadata when enabled
`duplicate_tracker`	Optional[DuplicateTracker]	None	Track duplicate files across multiple exports
`duplicate_content_strategy`	`"full"`/`"link"`	"full"	Replace repeated file content with a link to the shared report
`exclude_directories`	List[str]	None	Directory names to skip entirely (case-insensitive)
`exclude_file_extensions`	List[str]	None	Extensions (e.g. `.log`) to skip

Codebase Presets

Preset	Description	Key Differences
`minimal`	Focus on essential source files only	Skips tests/docs/configs, disables AI summaries, emits only file path + language metadata
`standard`	Balanced default for most repos	Includes tests/docs/configs, AI summaries enabled, metadata covers project/file aliases and structure
`comprehensive`	Deep dive for large audits	Higher size limit (5â€¯MB), hidden files allowed, emits every metadata field (dates, encoding, git info, etc.)

Use processing_preset="<name>" when calling export_codebase_to_markdown. You can still override any field via **CodebaseProcessingConfig(...).model_dump() if a preset needs tweaks.

Tracking shared files

You can collect repeated scripts/configs that appear across many projects using DuplicateTracker and optionally replace duplicated content with shared references:

from rag_knowledge_preparation import export_codebase_to_markdown, CodebaseProcessingConfig
from rag_knowledge_preparation.codebase_processing.utils import DuplicateTracker

tracker = DuplicateTracker(min_occurrences=2)

for project in projects:
    export_codebase_to_markdown(
        project,
        output_file=f"{project.name}.md",
        processing_preset="standard",
        duplicate_tracker=tracker,
        duplicate_content_strategy="link",  # swap file bodies with references to shared digest
    )

if tracker.has_duplicates():
    tracker.write_markdown_report("shared_files.md")

The generated shared_files.md lists every repeated file, its digest, language, and all project locations, so you can link to a single canonical snippet instead of duplicating boilerplate. Duplicate detection uses a raw hash of the file contents (byte-for-byte match). Set duplicate_content_strategy="link" in CodebaseProcessingConfig to replace duplicate files in the per-project exports with a short sentence that points to the shared digest instead of embedding their full content.

Error Handling

The library provides comprehensive error handling with custom exceptions:

from rag_knowledge_preparation import (
    RAGKnowledgePreparationError,
    DocumentNotFoundError,
    ConfigurationError,
    ConversionError,
    UnsupportedFormatError
)

try:
    content = convert_document_to_markdown("nonexistent.pdf")
except DocumentNotFoundError as e:
    print(f"Document not found: {e}")
except ConversionError as e:
    print(f"Conversion failed: {e}")
except ConfigurationError as e:
    print(f"Configuration error: {e}")

Performance Considerations

Large File Processing

The library includes built-in optimizations for large files:

File Size Limits: Configurable maximum file size limits
Memory Efficiency: Streaming processing for large documents
Batch Processing: Efficient processing of multiple files
Parallel Processing: Concurrent processing where possible

Performance Modes

# Use performance-optimized settings
config = CodebaseProcessingConfig(
    max_file_size_mb=0.5,  # Smaller file limit
    enable_ai_summary=False,  # Disable AI for speed
    enable_structure_analysis=False  # Disable structure analysis
)

Examples

Complete Document Processing Pipeline

from rag_knowledge_preparation import (
    convert_folder_to_markdown,
    list_document_configs
)

# List available configurations
configs = list_document_configs()
print("Available configurations:", list(configs.keys()))

# Process entire document folder
results = convert_folder_to_markdown(
    "./documents/",
    processing_preset="high_quality"
)

# Save results
for file_path, content in results.items():
    output_path = f"processed_{file_path.split('/')[-1]}.md"
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)

Complete Codebase Analysis Pipeline

from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview,
    list_available_codebase_configs
)

# List available configurations
configs = list_available_codebase_configs()
print("Available configurations:", list(configs.keys()))

# Get overview
overview = get_codebase_overview("./my_project")
print(f"Project: {overview['name']}")
print(f"Files: {overview['total_files']}")
print(f"Languages: {overview['languages']}")

# Analyze structure
structure = analyze_codebase_structure("./my_project")
print(f"Structure analysis complete: {structure['total_files']} files processed")

# Export to Markdown
output_file = export_codebase_to_markdown(
    "./my_project",
    output_file="project_analysis.md",
    gemini_api_key="your-api-key"
)
print(f"Exported to: {output_file}")

Acknowledgments

Gemini OCR stack powered by LangChain Google Gemini
Tree-sitter for code parsing
Google Gemini for AI-powered summarization
Pygments for syntax highlighting and language detection

Changelog

Version 1.0.0

Initial release
Document processing with OCR support
Codebase analysis and export
AI-powered summarization
Comprehensive configuration options
Multi-language support

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.3

Feb 23, 2026

1.0.2

Dec 1, 2025

1.0.1

Nov 18, 2025

1.0.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_knowledge_preparation-1.0.3.tar.gz (48.8 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_knowledge_preparation-1.0.3-py3-none-any.whl (56.1 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file rag_knowledge_preparation-1.0.3.tar.gz.

File metadata

Download URL: rag_knowledge_preparation-1.0.3.tar.gz
Upload date: Feb 23, 2026
Size: 48.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for rag_knowledge_preparation-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`3efd2f926872051e8e8e0bb3ce2038b4f0471b2ab23a1a5c2b3f5c5e28137d52`
MD5	`5eadf8561a5e2e7f03ebbf585f57d866`
BLAKE2b-256	`3aeddb16b942358a33898feba4bb24a08dc09dc2a1de9207b2b89a7ec00adc64`

See more details on using hashes here.

File details

Details for the file rag_knowledge_preparation-1.0.3-py3-none-any.whl.

File metadata

Download URL: rag_knowledge_preparation-1.0.3-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 56.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for rag_knowledge_preparation-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5167bd20913709fde3c7561b5a27c85552ea7a479ba4c056910b25fc08143f0a`
MD5	`ee5bb716b0d5f08703f9e1bd053a0019`
BLAKE2b-256	`156d2fd6887ae44b56e5dab2111a68a1c47d815787a5d83055ee0a772d39b414`

See more details on using hashes here.

rag-knowledge-preparation 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Knowledge Preparation Python

Features

Document Processing Features

Codebase Analysis Features

Configuration & Customization

Installation

Prerequisites

Development Installation

Quick Start

Document Processing

Codebase Analysis

Document Processing Details

Supported Formats

Processing Presets

Basic Processing

Standard Document Processing

OCR-Heavy Processing

Table-Focused Processing

High-Quality Processing

Custom Configuration

Batch Processing

Working with non-PDF inputs

Codebase Analysis Usage

Basic Analysis

Export to Markdown

AI-Powered Analysis

Codebase Processing Presets

Minimal Processing

Standard Processing

Comprehensive Processing

Configuration Options

Project-aware metadata & aliases

Advanced Features

Language Detection and Classification

Dependency Analysis

Code Structure Extraction

Token Estimation

Configuration Reference

Document Processing Configuration

Codebase Processing Configuration

Codebase Presets

Tracking shared files

Error Handling

Performance Considerations

Large File Processing

Performance Modes

Examples

Complete Document Processing Pipeline

Complete Codebase Analysis Pipeline

Acknowledgments

Changelog

Version 1.0.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes