Skip to main content

Convert source code to structured, context-optimized markdown for LLMs with intelligent summarization

Project description

src2md

License PyPI Python Versions Version

src2md is a powerful tool that converts source code repositories into structured, context-window-optimized representations for Large Language Models (LLMs). It addresses the fundamental challenge of fitting meaningful codebases into limited context windows while preserving the most important information through intelligent summarization, AST-based analysis, and optional LLM-powered compression.

🚀 Features

New in v2.0

  • 🎯 Context Window Optimization: Intelligently fit codebases into LLM context windows with smart truncation
  • 📝 Intelligent Summarization: AST-based code analysis with multiple compression levels
  • 🤖 LLM-Powered Compression: Optional OpenAI/Anthropic integration for semantic summarization
  • ⚡ Fluent API: Elegant method chaining with new summarization methods
  • 📊 File Importance Scoring: Multi-factor analysis to prioritize critical files
  • 🪟 Predefined LLM Windows: Built-in support for GPT-4, Claude, and more
  • 🔄 Progressive Summarization: Multi-tier compression strategies for different file types

Core Features

  • Multiple Output Formats: JSON, JSONL, Markdown, HTML, and plain text
  • Smart Token Management: Accurate token counting with tiktoken and structure-aware truncation
  • Multi-Language Support: Specialized summarizers for Python, JavaScript, TypeScript, JSON, YAML
  • Code Statistics: Automatic generation of project metrics and complexity analysis
  • Flexible Filtering: Customizable include/exclude patterns
  • Rich CLI Interface: Beautiful progress indicators and colored output

📦 Installation

Install via PyPI using pip:

pip install src2md

🛠️ Usage

Quick Start - Fluent API

from src2md import Repository, ContextWindow

# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()

# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
    .optimize_for(ContextWindow.GPT_4)
    .analyze()
    .to_markdown())

# Full fluent API with all features
result = (Repository("/path/to/project")
    .name("MyProject")
    .branch("main")
    .include("src/", "lib/")
    .exclude("tests/", "*.log")
    .with_importance_scoring()
    .with_summarization(
        compression_ratio=0.3,  # Target 30% of original size
        preserve_important=True,  # Keep critical files intact
        use_llm=True  # Use LLM if available
    )
    .prioritize(["main.py", "core/"])
    .optimize_for_tokens(100_000)  # 100K token limit
    .analyze()
    .to_json(pretty=True))

Command Line Interface

# Basic markdown generation
src2md /path/to/project -o documentation.md

# With context optimization
src2md /path/to/project --gpt4 -o optimized.md
src2md /path/to/project --claude3 --importance

# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3
src2md /path/to/project --summarize-tests --summarize-docs

# With LLM-powered summarization (requires API key)
src2md /path/to/project --use-llm --llm-model gpt-3.5-turbo

# Multiple output formats
src2md /path/to/project --format json --pretty
src2md /path/to/project --format html -o docs.html

Python API Examples

Basic Context Optimization

from src2md import Repository, ContextWindow

# Optimize for different LLM context windows
repo = Repository("./my-project")
output = repo.optimize_for(ContextWindow.CLAUDE_3).analyze().to_markdown()

# Custom token limit with importance scoring
repo = (Repository("./my-project")
    .with_importance_scoring()
    .optimize_for_tokens(50_000)
    .analyze())

Intelligent Summarization

# Enable smart summarization with compression
repo = (Repository("./my-project")
    .with_summarization(
        compression_ratio=0.3,  # Compress to 30% of original
        preserve_important=True,  # Keep critical files intact
        use_llm=False  # Use AST-based summarization
    )
    .optimize_for(ContextWindow.GPT_4)
    .analyze())

# Use LLM-powered summarization (requires API key)
import os
os.environ['OPENAI_API_KEY'] = 'your-key-here'

repo = (Repository("./my-project")
    .with_summarization(
        compression_ratio=0.2,  # More aggressive compression
        use_llm=True,
        llm_model="gpt-3.5-turbo"
    )
    .analyze())

Multi-Tier Compression Strategy

# Configure different summarization levels for different file types
repo = (Repository("./my-project")
    .with_importance_scoring()
    .prioritize(["src/core/", "api/"])  # Critical paths
    .summarize_tests()  # Compress test files
    .summarize_docs()   # Compress documentation
    .with_summarization(
        compression_ratio=0.25,
        preserve_important=True
    )
    .optimize_for_tokens(100_000)
    .analyze())

# Access summarization metadata
data = repo.to_dict()
for file in data['source_files']:
    if file.get('was_summarized'):
        print(f"Summarized {file['path']}: {file['original_size']} -> {file['size']} bytes")

Generate Multiple Formats

repo = Repository("./my-project").analyze()
markdown = repo.to_markdown()
json_data = repo.to_json()
html_doc = repo.to_html()

# Access raw data
data = repo.to_dict()
print(f"Files: {data['metadata']['file_count']}")
print(f"Token usage: {data['metadata'].get('total_tokens', 0)}")
print(f"Compression achieved: {data['metadata'].get('compression_ratio', 1.0):.1%}")

🎯 Summarization Features

AST-Based Python Summarization

src2md uses Abstract Syntax Tree (AST) analysis to intelligently summarize Python code while preserving structure:

  • MINIMAL: Only class/function signatures
  • OUTLINE: Signatures with structural hierarchy
  • DOCSTRINGS: Signatures plus documentation
  • SIGNATURES: Full signatures with type hints
  • FULL: No summarization

Multi-Language Support

Specialized summarizers for different file types:

  • Python: AST-based analysis with import/export preservation
  • JavaScript/TypeScript: Function and class extraction
  • JSON/YAML: Schema extraction with sample data
  • Test Files: Test name and assertion extraction
  • Documentation: Heading and key point extraction

Smart Truncation

When files must be truncated to fit token limits:

  1. Preserves code structure (complete functions/classes)
  2. Maintains syntax validity
  3. Prioritizes public APIs over private methods
  4. Keeps imports and exports intact

LLM-Powered Summarization

Optional integration with OpenAI and Anthropic for semantic compression:

# Set API keys
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"

# Use LLM summarization
src2md /path/to/project --use-llm --llm-model gpt-3.5-turbo
src2md /path/to/project --use-llm --llm-model claude-3-haiku-20240307

📊 Output Formats

JSON

Structured data perfect for programmatic processing:

{
  "metadata": {
    "project_name": "my-project",
    "generated_at": "2025-01-01T12:00:00",
    "patterns": {...}
  },
  "statistics": {
    "total_files": 42,
    "languages": {"python": {"count": 15, "total_size": 50000}},
    "project_complexity": 3.2
  },
  "documentation": [...],
  "source_files": [...]
}

JSONL

One JSON object per line - perfect for streaming and big data tools:

{"type": "metadata", "data": {...}}
{"type": "statistics", "data": {...}}
{"type": "source_file", "data": {...}}

HTML

Beautiful, styled documentation ready for the web with syntax highlighting and responsive design.

Markdown

Clean, readable documentation compatible with GitHub, GitLab, and other platforms.

🔧 Advanced Options

File Patterns

# Custom documentation patterns
src2md project --doc-pat '*.md' '*.rst' '*.txt'

# Specific source file types
src2md project --src-pat '*.py' '*.js' '*.ts'

# Ignore patterns
src2md project --ignore-pat '*.pyc' 'node_modules/' '.git/'

Ignore Files

Create a .src2mdignore file in your project root:

# Dependencies
node_modules/
__pycache__/
*.pyc

# Build outputs
dist/
build/
*.egg-info/

# IDE files
.vscode/
.idea/

Configuration

# Use custom ignore file
src2md project --ignore-file .gitignore

# Disable statistics
src2md project --no-stats

# Metadata only (no file contents)
src2md project --no-content

🎯 Use Cases

  • LLM Context: Generate structured context for AI/ML models
  • Documentation: Create beautiful project documentation
  • Code Analysis: Extract metrics and statistics from codebases
  • Data Export: Convert code to structured formats for analysis
  • Archive: Create comprehensive snapshots of projects
  • CI/CD: Generate documentation automatically in build pipelines

📈 Statistics & Metrics

src2md automatically generates:

  • File Metrics: Counts by type and language
  • Code Complexity: Cyclomatic complexity scores
  • Token Usage: Actual token counts for LLM context
  • Compression Stats: Before/after summarization metrics
  • Importance Scores: File prioritization rankings
  • Language Breakdown: Distribution of code by language
  • Structure Analysis: Dependency and module relationships

🤝 Migration from v0.x

The new version is backward compatible. Existing commands work unchanged:

# This still works exactly as before
src2md project -o docs.md --doc-pat '*.md' --src-pat '*.py'

New features are opt-in through additional flags and the Python API.

📄 License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

src2md-2.1.0.tar.gz (65.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

src2md-2.1.0-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file src2md-2.1.0.tar.gz.

File metadata

  • Download URL: src2md-2.1.0.tar.gz
  • Upload date:
  • Size: 65.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for src2md-2.1.0.tar.gz
Algorithm Hash digest
SHA256 2205b4aa826ae5cf9a93e4e5a96b486aeec0c7fa21fdd56382895b36259565ca
MD5 3fdb8e25d899e4252b24f1001400fc64
BLAKE2b-256 ad2cd9427f53c52ed0048671c0d0c6f7fafd4e287ec7f79feaf3dea62e2b9efe

See more details on using hashes here.

File details

Details for the file src2md-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: src2md-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for src2md-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 56c3cd2c6cbb0f3d1f26287319ec89a794be945ae224c7d6ab611093d1538c3b
MD5 071ef7960b2bc476a25edc6a886a0b92
BLAKE2b-256 2a942cc2a8820f70628da4a038292142f8c244c0f4af94094b65cf45c5baa88c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page