RAG Knowledge Preparation in Python
Project description
RAG Knowledge Preparation Python
A comprehensive Python library for preparing knowledge bases for Retrieval-Augmented Generation (RAG) systems. This library now focuses on Gemini OCR-based PDF-to-Markdown conversion alongside intelligent codebase analysis.
Features
Document Processing Features
- Multi-format Intake: PDF, images (PNG/JPG/TIFF/BMP/WebP), DOCX, and text/Markdown/CSV
- Gemini OCR: Convert PDFs/images to Markdown via Gemini 2.5 Pro multimodal
- Strict Markdown Output: Page-by-page extraction with a table-aware prompt
- Async & Parallel: Concurrency controls for multi-page PDFs
- Batch Processing: Process multiple PDFs or entire folders efficiently
- Configurable Quality: Presets for fast, table-focused, or high-quality OCR
Codebase Analysis Features
- Comprehensive Analysis: Extract structure, dependencies, and metadata from codebases
- Multi-language Support: Python, JavaScript, TypeScript and more
- AI-Powered Summaries: Generate intelligent code summaries using Google Gemini
- Project-aware Metadata: Capture project names, aliases, and file aliases for precise RAG context
- Dependency Analysis: Identify and categorize internal, external, and standard library dependencies
- Structure Extraction: Parse classes, functions, imports, and code organization
- Token Estimation: Accurate token counting for RAG optimization
Configuration & Customization
- Flexible Configuration: Extensive configuration options for both document and codebase processing
- Preset Configurations: Pre-built configurations for common use cases
- Custom Metadata: Configurable metadata fields for different analysis needs
- Performance Optimization: Built-in performance modes for large-scale processing
Installation
Prerequisites
- Poppler (for pdf2image):
brew install poppler(macOS) orsudo apt-get install -y poppler-utils(Linux) - Gemini API key: set the
GOOGLE_API_KEYenvironment variable (needed only for OCR on PDF/images)
pip install rag-knowledge-preparation-python
Development Installation
git clone
cd rag-knowledge-preparation-python
pip install -e ".[dev]"
Quick Start
Document Processing
from rag_knowledge_preparation import (
convert_document_to_markdown,
convert_scanned_document_to_markdown,
convert_documents_batch
)
# Convert a single document (GOOGLE_API_KEY env var must be set)
markdown_content = convert_document_to_markdown("document.pdf")
# Convert a scanned document with OCR
scanned_content = convert_scanned_document_to_markdown("scanned_document.pdf")
# Process multiple documents
results = convert_documents_batch(["doc1.pdf", "doc2.pdf"])
# DOCX/text/CSV/Markdown are handled locally (no API key needed)
docx_md = convert_document_to_markdown("report.docx")
notes_md = convert_document_to_markdown("notes.md")
# Images go through Gemini OCR (needs GOOGLE_API_KEY)
image_md = convert_document_to_markdown("whiteboard.png")
Codebase Analysis
from rag_knowledge_preparation import (
export_codebase_to_markdown,
analyze_codebase_structure,
get_codebase_overview
)
# Export entire codebase to Markdown
output_file = export_codebase_to_markdown("./my_project", "codebase_export.md")
# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")
# Get high-level overview
overview = get_codebase_overview("./my_project")
Document Processing Details
Supported Formats
- PDF: Gemini OCR (rasterized to images under the hood)
- Images: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP (Gemini OCR)
- DOCX: Parsed to Markdown via
python-docx(no OCR required) - Text/Markdown/CSV: Read directly with encoding auto-detection (no OCR required)
Processing Presets
Basic Processing
from rag_knowledge_preparation import convert_document_to_markdown
# Basic, lightweight OCR (lower DPI + fewer tokens)
content = convert_document_to_markdown(
"document.pdf",
processing_preset="basic"
)
Standard Document Processing
# Balanced Gemini OCR (default prompt/DPI)
content = convert_document_to_markdown(
"document.pdf",
processing_preset="standard"
)
OCR-Heavy Processing
# Higher DPI and retries for tough scans
content = convert_document_to_markdown(
"scanned_document.pdf",
processing_preset="ocr_heavy"
)
Table-Focused Processing
# Table-aware prompt for documents with dense tabular content
content = convert_document_to_markdown(
"data_heavy_document.pdf",
processing_preset="table_focused"
)
High-Quality Processing
# Maximum quality with highest DPI and token limits
content = convert_document_to_markdown(
"important_document.pdf",
processing_preset="high_quality"
)
Custom Configuration
from rag_knowledge_preparation import convert_document_to_markdown
# Custom configuration
content = convert_document_to_markdown(
"document.pdf",
processing_preset="standard",
dpi=350,
page_selection="1-5,8",
temperature=0.15,
max_output_tokens=5000
)
Batch Processing
from rag_knowledge_preparation import convert_documents_batch, convert_folder_to_markdown
# Process multiple files
results = convert_documents_batch([
"document1.pdf",
"document2.pdf"
])
# Process entire folder
folder_results = convert_folder_to_markdown("./documents/")
Working with non-PDF inputs
# Images -> Gemini OCR (needs GOOGLE_API_KEY)
image_markdown = convert_document_to_markdown("whiteboard.png")
# DOCX -> parsed locally, no OCR/API key required
docx_markdown = convert_document_to_markdown("report.docx")
# Text/Markdown/CSV -> pass-through
notes_markdown = convert_document_to_markdown("notes.txt")
Folder/batch helpers (convert_documents_batch, convert_folder_to_markdown) automatically pick up all supported extensions.
Codebase Analysis Usage
Basic Analysis
from rag_knowledge_preparation import analyze_codebase_structure
# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")
print(f"Total files: {structure['total_files']}")
print(f"Total lines: {structure['total_lines']}")
print(f"Languages: {structure['languages']}")
Export to Markdown
from rag_knowledge_preparation import export_codebase_to_markdown
# Export with default settings
output_file = export_codebase_to_markdown("./my_project")
# Export with custom output file
output_file = export_codebase_to_markdown(
"./my_project",
output_file="my_codebase.md"
)
AI-Powered Analysis
from rag_knowledge_preparation import export_codebase_to_markdown
# Export with AI summaries (requires Gemini API key)
output_file = export_codebase_to_markdown(
"./my_project",
gemini_api_key="your-google-api-key",
gemini_model="gemini-2.5-flash"
)
Codebase Processing Presets
Minimal Processing
from rag_knowledge_preparation import export_codebase_to_markdown
# Minimal processing - basic analysis only
output_file = export_codebase_to_markdown(
"./my_project",
processing_preset="minimal"
)
Standard Processing
# Standard processing with full analysis
output_file = export_codebase_to_markdown(
"./my_project",
processing_preset="standard"
)
Comprehensive Processing
# Comprehensive processing with all features
output_file = export_codebase_to_markdown(
"./my_project",
processing_preset="comprehensive"
)
Configuration Options
from rag_knowledge_preparation import (
CodebaseProcessingConfig,
MetadataConfig,
export_codebase_to_markdown
)
# Custom configuration
config = CodebaseProcessingConfig(
max_file_size_mb=2.0,
include_test_files=False,
include_documentation=True,
enable_ai_summary=True,
gemini_api_key="your-api-key",
custom_ignore_patterns=["*.log", "temp/*"]
)
# Custom metadata configuration
metadata_config = MetadataConfig(
include_file_path=True,
include_language=True,
include_purpose=True,
include_dependencies=True,
include_structure=True,
include_summary=True
)
config.metadata_config = metadata_config
# Use custom configuration
output_file = export_codebase_to_markdown(
"./my_project",
processing_preset="standard", # apply overrides on top of the standard preset
**config.model_dump()
)
Project-aware metadata & aliases
You can enrich every exported file with project context so downstream RAG systems can ground answers:
from rag_knowledge_preparation import CodebaseProcessingConfig, MetadataConfig
config = CodebaseProcessingConfig(
project_name="EIC AI Knowledge Utils",
project_aliases=["EIC-AI", "Knowledge Utils"],
project_description="Utilities that prep internal knowledge for RAG pipelines.",
metadata_config=MetadataConfig(
include_project_description=True,
include_project_aliases=True,
include_file_aliases=True
)
)
The exporter now injects the project name, aliases, optional description, and a set of handy file aliases (for example, Project::path/to/file.py). The Gemini prompt receives this context, yet the summaries stay concise because the Metadata block already lists project and path information.
MetadataConfig ships with four new toggles (include_project_name, include_project_aliases, include_project_description, include_file_aliases) that default to True (description defaults to False). Disable them if you prefer leaner metadata blocks.
Advanced Features
Language Detection and Classification
The library automatically detects programming languages and classifies files by purpose:
from rag_knowledge_preparation.codebase_processing.analysis import (
get_language_from_extension,
classify_file_by_purpose
)
# Detect language from file extension
language = get_language_from_extension("script.py") # Returns "python"
# Classify file by purpose
purpose = classify_file_by_purpose("test_utils.py") # Returns "Tests"
Dependency Analysis
from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import analyze_file_dependencies
# Analyze dependencies in a Python file
with open("main.py", "r") as f:
content = f.read()
dependencies = analyze_file_dependencies(content, Path("main.py"), "python")
print("External packages:", dependencies["external_packages"])
print("Standard library:", dependencies["standard_library"])
print("Internal modules:", dependencies["internal_modules"])
Code Structure Extraction
from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import extract_code_structure
# Extract structure from code file
code_content = """
class MyClass:
def __init__(self):
pass
def method(self):
pass
"""
structure = extract_code_structure(Path("example.py"), "python", code_content)
print("Classes:", structure["classes"])
print("Functions:", structure["functions"])
Token Estimation
from rag_knowledge_preparation.codebase_processing.analysis import estimate_token_count
# Estimate tokens in text
token_count = estimate_token_count("Hello, world!")
print(f"Estimated tokens: {token_count}")
# Estimate tokens in code
code_tokens = estimate_token_count("""
def hello():
print("Hello, world!")
""")
Configuration Reference
Document Processing Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str | "gemini-2.5-pro" | Gemini multimodal model for OCR |
prompt |
str | Markdown prompt | Per-page OCR extraction prompt |
temperature |
float | 0.2 | Model temperature |
max_output_tokens |
int | 4096 | Max tokens per page generation |
dpi |
int | 300 | DPI used when rasterizing PDFs |
page_selection |
Optional[str] | None | Page ranges, e.g. "1-3,5" |
parallel_concurrency |
int | 5 | Pages processed concurrently |
max_retries |
int | 4 | Retry attempts for transient errors |
Codebase Processing Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
max_file_size_mb |
float | 1.0 | Maximum file size to process |
include_hidden_files |
bool | False | Include hidden files |
include_test_files |
bool | True | Include test files |
include_documentation |
bool | True | Include documentation files |
include_config_files |
bool | True | Include configuration files |
include_static_assets |
bool | False | Include binary/static assets (images, fonts, etc.) |
enable_structure_analysis |
bool | True | Enable code structure analysis |
enable_ai_summary |
bool | True | Enable AI-powered summaries |
gemini_api_key |
str | None | Google Gemini API key |
gemini_model |
str | "gemini-2.5-flash" | Gemini model to use |
custom_ignore_patterns |
List[str] | None | Custom ignore patterns |
project_name |
Optional[str] | None | Override for the primary project name used in summaries and metadata |
project_aliases |
List[str] | [] | Additional aliases that will also be emitted in metadata |
project_description |
Optional[str] | None | Short description included in metadata when enabled |
duplicate_tracker |
Optional[DuplicateTracker] | None | Track duplicate files across multiple exports |
duplicate_content_strategy |
"full"/"link" |
"full" | Replace repeated file content with a link to the shared report |
exclude_directories |
List[str] | None | Directory names to skip entirely (case-insensitive) |
exclude_file_extensions |
List[str] | None | Extensions (e.g. .log) to skip |
Codebase Presets
| Preset | Description | Key Differences |
|---|---|---|
minimal |
Focus on essential source files only | Skips tests/docs/configs, disables AI summaries, emits only file path + language metadata |
standard |
Balanced default for most repos | Includes tests/docs/configs, AI summaries enabled, metadata covers project/file aliases and structure |
comprehensive |
Deep dive for large audits | Higher size limit (5 MB), hidden files allowed, emits every metadata field (dates, encoding, git info, etc.) |
Use processing_preset="<name>" when calling export_codebase_to_markdown. You can still override any field via **CodebaseProcessingConfig(...).model_dump() if a preset needs tweaks.
Tracking shared files
You can collect repeated scripts/configs that appear across many projects using DuplicateTracker and optionally replace duplicated content with shared references:
from rag_knowledge_preparation import export_codebase_to_markdown, CodebaseProcessingConfig
from rag_knowledge_preparation.codebase_processing.utils import DuplicateTracker
tracker = DuplicateTracker(min_occurrences=2)
for project in projects:
export_codebase_to_markdown(
project,
output_file=f"{project.name}.md",
processing_preset="standard",
duplicate_tracker=tracker,
duplicate_content_strategy="link", # swap file bodies with references to shared digest
)
if tracker.has_duplicates():
tracker.write_markdown_report("shared_files.md")
The generated shared_files.md lists every repeated file, its digest, language, and all project locations, so you can link to a single canonical snippet instead of duplicating boilerplate. Duplicate detection uses a raw hash of the file contents (byte-for-byte match). Set duplicate_content_strategy="link" in CodebaseProcessingConfig to replace duplicate files in the per-project exports with a short sentence that points to the shared digest instead of embedding their full content.
Error Handling
The library provides comprehensive error handling with custom exceptions:
from rag_knowledge_preparation import (
RAGKnowledgePreparationError,
DocumentNotFoundError,
ConfigurationError,
ConversionError,
UnsupportedFormatError
)
try:
content = convert_document_to_markdown("nonexistent.pdf")
except DocumentNotFoundError as e:
print(f"Document not found: {e}")
except ConversionError as e:
print(f"Conversion failed: {e}")
except ConfigurationError as e:
print(f"Configuration error: {e}")
Performance Considerations
Large File Processing
The library includes built-in optimizations for large files:
- File Size Limits: Configurable maximum file size limits
- Memory Efficiency: Streaming processing for large documents
- Batch Processing: Efficient processing of multiple files
- Parallel Processing: Concurrent processing where possible
Performance Modes
# Use performance-optimized settings
config = CodebaseProcessingConfig(
max_file_size_mb=0.5, # Smaller file limit
enable_ai_summary=False, # Disable AI for speed
enable_structure_analysis=False # Disable structure analysis
)
Examples
Complete Document Processing Pipeline
from rag_knowledge_preparation import (
convert_folder_to_markdown,
list_document_configs
)
# List available configurations
configs = list_document_configs()
print("Available configurations:", list(configs.keys()))
# Process entire document folder
results = convert_folder_to_markdown(
"./documents/",
processing_preset="high_quality"
)
# Save results
for file_path, content in results.items():
output_path = f"processed_{file_path.split('/')[-1]}.md"
with open(output_path, 'w', encoding='utf-8') as f:
f.write(content)
Complete Codebase Analysis Pipeline
from rag_knowledge_preparation import (
export_codebase_to_markdown,
analyze_codebase_structure,
get_codebase_overview,
list_available_codebase_configs
)
# List available configurations
configs = list_available_codebase_configs()
print("Available configurations:", list(configs.keys()))
# Get overview
overview = get_codebase_overview("./my_project")
print(f"Project: {overview['name']}")
print(f"Files: {overview['total_files']}")
print(f"Languages: {overview['languages']}")
# Analyze structure
structure = analyze_codebase_structure("./my_project")
print(f"Structure analysis complete: {structure['total_files']} files processed")
# Export to Markdown
output_file = export_codebase_to_markdown(
"./my_project",
output_file="project_analysis.md",
gemini_api_key="your-api-key"
)
print(f"Exported to: {output_file}")
Acknowledgments
- Gemini OCR stack powered by LangChain Google Gemini
- Tree-sitter for code parsing
- Google Gemini for AI-powered summarization
- Pygments for syntax highlighting and language detection
Changelog
Version 1.0.0
- Initial release
- Document processing with OCR support
- Codebase analysis and export
- AI-powered summarization
- Comprehensive configuration options
- Multi-language support
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_knowledge_preparation-1.0.2.tar.gz.
File metadata
- Download URL: rag_knowledge_preparation-1.0.2.tar.gz
- Upload date:
- Size: 47.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aef0b8e1355de43f1848b046d6597b7cf9fb91bf009bdfc2b1b8fd20228e5f0
|
|
| MD5 |
961d843dfd100542dbeb4075f2b65e34
|
|
| BLAKE2b-256 |
1c193792f8e460074ececc9dc033437f4e6af98717c3a785e9829cc0c70d18fb
|
File details
Details for the file rag_knowledge_preparation-1.0.2-py3-none-any.whl.
File metadata
- Download URL: rag_knowledge_preparation-1.0.2-py3-none-any.whl
- Upload date:
- Size: 44.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59cab457910903f3ab3c158ae727f4ffbd4c24f252b904ba5596e4b6b49070d3
|
|
| MD5 |
2931e61c0b92a5445967fc35778e0c7f
|
|
| BLAKE2b-256 |
63689529ef40b8718568cb8dd3fe3116d9346f3f9a39083b06dfeeedd506eaf0
|