Convert source code to structured, context-optimized markdown for LLMs with intelligent summarization
Project description
src2md
src2md is a powerful tool that converts source code repositories into structured, context-window-optimized representations for Large Language Models (LLMs). It addresses the fundamental challenge of fitting meaningful codebases into limited context windows while preserving the most important information through intelligent summarization, AST-based analysis, and optional LLM-powered compression.
🚀 Features
New in v2.0
- 🎯 Context Window Optimization: Intelligently fit codebases into LLM context windows with smart truncation
- 📝 Intelligent Summarization: AST-based code analysis with multiple compression levels
- 🤖 LLM-Powered Compression: Optional OpenAI/Anthropic integration for semantic summarization
- ⚡ Fluent API: Elegant method chaining with new summarization methods
- 📊 File Importance Scoring: Multi-factor analysis to prioritize critical files
- 🪟 Predefined LLM Windows: Built-in support for GPT-4, Claude, and more
- 🔄 Progressive Summarization: Multi-tier compression strategies for different file types
Core Features
- Multiple Output Formats: JSON, JSONL, Markdown, HTML, and plain text
- Smart Token Management: Accurate token counting with tiktoken and structure-aware truncation
- Multi-Language Support: Specialized summarizers for Python, JavaScript, TypeScript, JSON, YAML
- Code Statistics: Automatic generation of project metrics and complexity analysis
- Flexible Filtering: Customizable include/exclude patterns
- Rich CLI Interface: Beautiful progress indicators and colored output
📦 Installation
Install via PyPI using pip:
pip install src2md
🛠️ Usage
Quick Start - Fluent API
from src2md import Repository, ContextWindow
# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()
# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
.optimize_for(ContextWindow.GPT_4)
.analyze()
.to_markdown())
# Full fluent API with all features
result = (Repository("/path/to/project")
.name("MyProject")
.branch("main")
.include("src/", "lib/")
.exclude("tests/", "*.log")
.with_importance_scoring()
.with_summarization(
compression_ratio=0.3, # Target 30% of original size
preserve_important=True, # Keep critical files intact
use_llm=True # Use LLM if available
)
.prioritize(["main.py", "core/"])
.optimize_for_tokens(100_000) # 100K token limit
.analyze()
.to_json(pretty=True))
Command Line Interface
# Basic markdown generation
src2md /path/to/project -o documentation.md
# With context optimization
src2md /path/to/project --gpt4 -o optimized.md
src2md /path/to/project --claude3 --importance
# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3
src2md /path/to/project --summarize-tests --summarize-docs
# With LLM-powered summarization (requires API key)
src2md /path/to/project --use-llm --llm-model gpt-3.5-turbo
# Multiple output formats
src2md /path/to/project --format json --pretty
src2md /path/to/project --format html -o docs.html
Python API Examples
Basic Context Optimization
from src2md import Repository, ContextWindow
# Optimize for different LLM context windows
repo = Repository("./my-project")
output = repo.optimize_for(ContextWindow.CLAUDE_3).analyze().to_markdown()
# Custom token limit with importance scoring
repo = (Repository("./my-project")
.with_importance_scoring()
.optimize_for_tokens(50_000)
.analyze())
Intelligent Summarization
# Enable smart summarization with compression
repo = (Repository("./my-project")
.with_summarization(
compression_ratio=0.3, # Compress to 30% of original
preserve_important=True, # Keep critical files intact
use_llm=False # Use AST-based summarization
)
.optimize_for(ContextWindow.GPT_4)
.analyze())
# Use LLM-powered summarization (requires API key)
import os
os.environ['OPENAI_API_KEY'] = 'your-key-here'
repo = (Repository("./my-project")
.with_summarization(
compression_ratio=0.2, # More aggressive compression
use_llm=True,
llm_model="gpt-3.5-turbo"
)
.analyze())
Multi-Tier Compression Strategy
# Configure different summarization levels for different file types
repo = (Repository("./my-project")
.with_importance_scoring()
.prioritize(["src/core/", "api/"]) # Critical paths
.summarize_tests() # Compress test files
.summarize_docs() # Compress documentation
.with_summarization(
compression_ratio=0.25,
preserve_important=True
)
.optimize_for_tokens(100_000)
.analyze())
# Access summarization metadata
data = repo.to_dict()
for file in data['source_files']:
if file.get('was_summarized'):
print(f"Summarized {file['path']}: {file['original_size']} -> {file['size']} bytes")
Generate Multiple Formats
repo = Repository("./my-project").analyze()
markdown = repo.to_markdown()
json_data = repo.to_json()
html_doc = repo.to_html()
# Access raw data
data = repo.to_dict()
print(f"Files: {data['metadata']['file_count']}")
print(f"Token usage: {data['metadata'].get('total_tokens', 0)}")
print(f"Compression achieved: {data['metadata'].get('compression_ratio', 1.0):.1%}")
🎯 Summarization Features
AST-Based Python Summarization
src2md uses Abstract Syntax Tree (AST) analysis to intelligently summarize Python code while preserving structure:
- MINIMAL: Only class/function signatures
- OUTLINE: Signatures with structural hierarchy
- DOCSTRINGS: Signatures plus documentation
- SIGNATURES: Full signatures with type hints
- FULL: No summarization
Multi-Language Support
Specialized summarizers for different file types:
- Python: AST-based analysis with import/export preservation
- JavaScript/TypeScript: Function and class extraction
- JSON/YAML: Schema extraction with sample data
- Test Files: Test name and assertion extraction
- Documentation: Heading and key point extraction
Smart Truncation
When files must be truncated to fit token limits:
- Preserves code structure (complete functions/classes)
- Maintains syntax validity
- Prioritizes public APIs over private methods
- Keeps imports and exports intact
LLM-Powered Summarization
Optional integration with OpenAI and Anthropic for semantic compression:
# Set API keys
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
# Use LLM summarization
src2md /path/to/project --use-llm --llm-model gpt-3.5-turbo
src2md /path/to/project --use-llm --llm-model claude-3-haiku-20240307
📊 Output Formats
JSON
Structured data perfect for programmatic processing:
{
"metadata": {
"project_name": "my-project",
"generated_at": "2025-01-01T12:00:00",
"patterns": {...}
},
"statistics": {
"total_files": 42,
"languages": {"python": {"count": 15, "total_size": 50000}},
"project_complexity": 3.2
},
"documentation": [...],
"source_files": [...]
}
JSONL
One JSON object per line - perfect for streaming and big data tools:
{"type": "metadata", "data": {...}}
{"type": "statistics", "data": {...}}
{"type": "source_file", "data": {...}}
HTML
Beautiful, styled documentation ready for the web with syntax highlighting and responsive design.
Markdown
Clean, readable documentation compatible with GitHub, GitLab, and other platforms.
🔧 Advanced Options
File Patterns
# Custom documentation patterns
src2md project --doc-pat '*.md' '*.rst' '*.txt'
# Specific source file types
src2md project --src-pat '*.py' '*.js' '*.ts'
# Ignore patterns
src2md project --ignore-pat '*.pyc' 'node_modules/' '.git/'
Ignore Files
Create a .src2mdignore file in your project root:
# Dependencies
node_modules/
__pycache__/
*.pyc
# Build outputs
dist/
build/
*.egg-info/
# IDE files
.vscode/
.idea/
Configuration
# Use custom ignore file
src2md project --ignore-file .gitignore
# Disable statistics
src2md project --no-stats
# Metadata only (no file contents)
src2md project --no-content
🎯 Use Cases
- LLM Context: Generate structured context for AI/ML models
- Documentation: Create beautiful project documentation
- Code Analysis: Extract metrics and statistics from codebases
- Data Export: Convert code to structured formats for analysis
- Archive: Create comprehensive snapshots of projects
- CI/CD: Generate documentation automatically in build pipelines
📈 Statistics & Metrics
src2md automatically generates:
- File Metrics: Counts by type and language
- Code Complexity: Cyclomatic complexity scores
- Token Usage: Actual token counts for LLM context
- Compression Stats: Before/after summarization metrics
- Importance Scores: File prioritization rankings
- Language Breakdown: Distribution of code by language
- Structure Analysis: Dependency and module relationships
🤝 Migration from v0.x
The new version is backward compatible. Existing commands work unchanged:
# This still works exactly as before
src2md project -o docs.md --doc-pat '*.md' --src-pat '*.py'
New features are opt-in through additional flags and the Python API.
📄 License
MIT License - see LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file src2md-2.1.0.tar.gz.
File metadata
- Download URL: src2md-2.1.0.tar.gz
- Upload date:
- Size: 65.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2205b4aa826ae5cf9a93e4e5a96b486aeec0c7fa21fdd56382895b36259565ca
|
|
| MD5 |
3fdb8e25d899e4252b24f1001400fc64
|
|
| BLAKE2b-256 |
ad2cd9427f53c52ed0048671c0d0c6f7fafd4e287ec7f79feaf3dea62e2b9efe
|
File details
Details for the file src2md-2.1.0-py3-none-any.whl.
File metadata
- Download URL: src2md-2.1.0-py3-none-any.whl
- Upload date:
- Size: 42.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56c3cd2c6cbb0f3d1f26287319ec89a794be945ae224c7d6ab611093d1538c3b
|
|
| MD5 |
071ef7960b2bc476a25edc6a886a0b92
|
|
| BLAKE2b-256 |
2a942cc2a8820f70628da4a038292142f8c244c0f4af94094b65cf45c5baa88c
|