Skip to main content

Convert source code to structured formats with intelligent LLM context optimization

Project description

src2md

License PyPI Python Versions Version

src2md is a powerful tool that converts source code repositories into structured formats optimized for Large Language Models (LLMs). With intelligent context window management, file importance scoring, and a fluent API, it's the perfect tool for feeding code context to AI models.

🚀 Features

New in v2.0

  • 🎯 Context Window Optimization: Intelligently fit codebases into LLM context windows
  • ⚡ Fluent API: Elegant method chaining for intuitive usage
  • 📊 File Importance Scoring: Multi-factor analysis to prioritize critical files
  • 🪟 Predefined LLM Windows: Built-in support for GPT-4, Claude, and more
  • 🔄 Progressive Summarization: Compress less important files to fit token limits

Core Features

  • Multiple Output Formats: JSON, JSONL, Markdown, HTML, and plain text
  • Smart Token Management: Accurate token counting with tiktoken
  • Code Statistics: Automatic generation of project metrics and complexity analysis
  • Flexible Filtering: Customizable include/exclude patterns
  • Rich CLI Interface: Beautiful progress indicators and colored output

📦 Installation

Install via PyPI using pip:

pip install src2md

🛠️ Usage

Quick Start - Fluent API (New in v2.0!)

from src2md import Repository, ContextWindow

# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()

# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
    .optimize_for(ContextWindow.GPT_4)
    .analyze()
    .to_markdown())

# Full fluent API with all features
result = (Repository("/path/to/project")
    .name("MyProject")
    .branch("main")
    .include("src/", "lib/")
    .exclude("tests/", "*.log")
    .with_importance_scoring()
    .prioritize(["main.py", "core/"])
    .optimize_for_tokens(100_000)  # 100K token limit
    .analyze()
    .to_json(pretty=True))

Command Line Interface

# Basic markdown generation
src2md /path/to/project -o documentation.md

# With context optimization (coming soon in CLI)
src2md /path/to/project --optimize-for gpt-4 -o optimized.md

# Multiple output formats
src2md /path/to/project --format json --pretty
src2md /path/to/project --format html -o docs.html

Python API Examples

from src2md import Repository, ContextWindow

# Example 1: Optimize for Claude 3
repo = Repository("./my-project")
output = repo.optimize_for(ContextWindow.CLAUDE_3).analyze().to_markdown()

# Example 2: Custom token limit with importance scoring
repo = (Repository("./my-project")
    .with_importance_scoring()
    .optimize_for_tokens(50_000)
    .analyze())

# Example 3: Generate multiple formats
repo = Repository("./my-project").analyze()
markdown = repo.to_markdown()
json_data = repo.to_json()
html_doc = repo.to_html()

# Example 4: Access raw data
data = repo.to_dict()
print(f"Files: {data['metadata']['file_count']}")
print(f"Languages: {list(data['statistics']['languages'].keys())}")

📊 Output Formats

JSON

Structured data perfect for programmatic processing:

{
  "metadata": {
    "project_name": "my-project",
    "generated_at": "2025-01-01T12:00:00",
    "patterns": {...}
  },
  "statistics": {
    "total_files": 42,
    "languages": {"python": {"count": 15, "total_size": 50000}},
    "project_complexity": 3.2
  },
  "documentation": [...],
  "source_files": [...]
}

JSONL

One JSON object per line - perfect for streaming and big data tools:

{"type": "metadata", "data": {...}}
{"type": "statistics", "data": {...}}
{"type": "source_file", "data": {...}}

HTML

Beautiful, styled documentation ready for the web with syntax highlighting and responsive design.

Markdown

Clean, readable documentation compatible with GitHub, GitLab, and other platforms.

🔧 Advanced Options

File Patterns

# Custom documentation patterns
src2md project --doc-pat '*.md' '*.rst' '*.txt'

# Specific source file types
src2md project --src-pat '*.py' '*.js' '*.ts'

# Ignore patterns
src2md project --ignore-pat '*.pyc' 'node_modules/' '.git/'

Ignore Files

Create a .src2mdignore file in your project root:

# Dependencies
node_modules/
__pycache__/
*.pyc

# Build outputs
dist/
build/
*.egg-info/

# IDE files
.vscode/
.idea/

Configuration

# Use custom ignore file
src2md project --ignore-file .gitignore

# Disable statistics
src2md project --no-stats

# Metadata only (no file contents)
src2md project --no-content

🎯 Use Cases

  • LLM Context: Generate structured context for AI/ML models
  • Documentation: Create beautiful project documentation
  • Code Analysis: Extract metrics and statistics from codebases
  • Data Export: Convert code to structured formats for analysis
  • Archive: Create comprehensive snapshots of projects
  • CI/CD: Generate documentation automatically in build pipelines

📈 Statistics & Metrics

src2md automatically generates:

  • File counts by type and language
  • Code complexity scores
  • Size metrics and distributions
  • Language breakdown
  • Project structure analysis

🤝 Migration from v0.x

The new version is backward compatible. Existing commands work unchanged:

# This still works exactly as before
src2md project -o docs.md --doc-pat '*.md' --src-pat '*.py'

New features are opt-in through additional flags and the Python API.

📄 License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

src2md-2.0.0.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

src2md-2.0.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file src2md-2.0.0.tar.gz.

File metadata

  • Download URL: src2md-2.0.0.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for src2md-2.0.0.tar.gz
Algorithm Hash digest
SHA256 44234c1ec9db1f05d545fe95ae0d03f61ba356a9f10333dbad33114df64576ea
MD5 93ef65097ccc93365027fe4ac466c475
BLAKE2b-256 cbe10730ea3a686c6602664cfea8da38fb0ccd270da5155487941a61be7872ae

See more details on using hashes here.

File details

Details for the file src2md-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: src2md-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for src2md-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc8aea1948cf09fdfb925b93d24062683d44d60f637ebea27aeabe2bad57d367
MD5 132e40c1d6c52e25580e67202ee4acab
BLAKE2b-256 9f2929cfe6e1672cbf39e810c7ec03d3bd54ffafc35611c37f8d959b013deb22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page