Skip to main content

Extract repository contents into formatted text for LLM context

Project description

repocontext

Extract repository contents into formatted text for LLM context.

A Python library that fetches GitHub repositories, builds hierarchical file trees, and generates formatted text output with directory structure and file contents including token counts for LLM context limits.

Features

  • GitHub Support: Fetch public and private repositories
  • Async Operations: Efficient file fetching with concurrency control
  • Token Counting: Accurate GPT token counting via tiktoken
  • Extensible: Easy to add new providers (GitLab, Azure DevOps, etc.)
  • Structured Output: Directory trees with file contents in markdown

Installation

pip install repocontext

Quick Start

from repocontext import fetch

# Simple usage - synchronous
result = fetch("https://github.com/owner/repo")
print(result.markdown)
print(f"Tokens: {result.token_count}")

# With file contents
result = fetch("https://github.com/owner/repo", token="ghp_xxx", fetch_content=True)
print(result.markdown)

Usage Examples

Using the Provider Directly

import asyncio
from repocontext import GitHubProvider, Formatter, build_tree

async def main():
    provider = GitHubProvider()
    
    # Fetch repository
    result = await provider.get_repository(
        "https://github.com/owner/repo",
        token=None,
        fetch_content=True
    )
    
    print(result.directory_tree)
    print(f"Found {result.file_count} files")
    print(f"Total tokens: {result.token_count}")

asyncio.run(main())

Filtering Files

import asyncio
from repocontext import GitHubProvider

async def main():
    provider = GitHubProvider()
    nodes = await provider.fetch_tree("https://github.com/owner/repo")
    
    # Get only Python files
    py_files = [n for n in nodes if n.is_file() and n.get_extension() == ".py"]
    
    # Get files from specific directory
    src_files = [n for n in nodes if n.path.startswith("src/")]
    
    # Get files larger than 1KB
    large_files = [n for n in nodes if n.is_file() and n.size and n.size > 1024]

asyncio.run(main())

Building Trees and Formatting

from repocontext import build_tree, Formatter, FileNode, TreeNode

# Build tree from flat nodes
tree = build_tree(
    nodes,
    selected_paths={n.path for n in selected_files},
    excluded_paths=set(),
    expanded_paths={n.path for n in nodes if n.is_directory()},
)

# Format as markdown
markdown = Formatter.format_markdown(tree, contents)

API Reference

Main Function

from repocontext import fetch

result = fetch(
    url="https://github.com/owner/repo",  # Required
    token=None,        # Optional GitHub token for private repos
    fetch_content=False # Set True to include file contents
)

Returns RepositoryResult with:

  • url - The repository URL
  • branch - The resolved branch name
  • files - List of FileNode objects
  • directories - List of directory paths
  • contents - List of FileContent objects (when fetch_content=True)
  • markdown - Full markdown output
  • directory_tree - ASCII tree representation
  • token_count - Total token count
  • line_count - Total line count
  • file_count - Number of files
  • stats - Statistics dictionary

Providers

GitHubProvider

from repocontext import GitHubProvider

provider = GitHubProvider()

# Set credentials (optional for public repos)
provider.set_credentials("ghp_your_token_here")

# Get full repository result
result = await provider.get_repository(
    url="https://github.com/owner/repo",
    token=None,
    fetch_content=False,
    branch=None  # Optional branch override
)

# Fetch tree only
nodes = await provider.fetch_tree(url, branch="main", path="src")

# Fetch multiple files with concurrency
async for content in provider.fetch_multiple(file_nodes):
    print(content.path, len(content.text))

Types

FileNode

from repocontext import FileNode

node = FileNode(path="src/main.py", type="blob", size=1024, sha="abc123")

node.is_file()        # True
node.is_directory()   # False
node.get_name()       # "main.py"
node.get_extension()  # ".py"

TreeNode

from repocontext import TreeNode

node = TreeNode(
    name="src",
    path="src",
    type="directory",
    children=[...],
    selected=True
)

node.is_file()        # False
node.is_directory()   # True

FileContent

from repocontext import FileContent

content = FileContent(
    path="src/main.py",
    text="print('hello')",
    url="https://...",
    line_count=1,
    token_count=3
)

Formatter

from repocontext import Formatter

# Count tokens
tokens = Formatter.count_tokens("hello world")

# Format project tree
tree_str = Formatter.format_project_tree(tree_nodes)

# Format as markdown
markdown = Formatter.format_markdown(tree, contents)

Tree Building

from repocontext import build_tree, extract_directories

# Build hierarchical tree from flat nodes
tree = build_tree(
    nodes,
    selected_paths=set_of_selected_paths,
    excluded_paths=set_of_excluded_paths,
    expanded_paths=set_of_expanded_directories
)

# Extract all directory paths
dirs = extract_directories(nodes)

Exception Handling

from repocontext import (
    InvalidURLError,
    AuthenticationError,
    NotFoundError,
    RateLimitError,
    NetworkError,
)

try:
    result = fetch("https://github.com/owner/repo")
except InvalidURLError as e:
    print(f"Invalid URL: {e.user_message}")
except AuthenticationError as e:
    print(f"Auth failed: {e.user_message}")
except RateLimitError as e:
    print(f"Rate limited: {e.user_message}")
except NetworkError as e:
    print(f"Network error: {e.user_message}")

Extending the Package

Adding a New Provider

Create a new provider by extending BaseProvider:

from repocontext.providers import BaseProvider, register_provider

@register_provider("gitlab")
class GitLabProvider(BaseProvider):
    API_BASE = "https://gitlab.com/api/v4"

    @property
    def get_type(self) -> str:
        return "gitlab"

    @property
    def get_name(self) -> str:
        return "GitLab"

    def requires_auth(self) -> bool:
        return True

    def validate_url(self, url: str) -> bool:
        return url.startswith("https://gitlab.com/")

    def parse_url(self, url: str) -> ParsedRepoInfo:
        # Parse the URL and return ParsedRepoInfo
        ...

    async def _fetch_tree(self, url: str, **options) -> list[FileNode]:
        # Fetch the repository tree
        ...

    async def _fetch_file_content(self, node: FileNode) -> FileContent:
        # Fetch a single file's content
        ...

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repo_context_lib-0.2.0-py3-none-any.whl (3.3 kB view details)

Uploaded Python 3

File details

Details for the file repo_context_lib-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for repo_context_lib-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 faae5dbf9142e9e7c872fcd14d5061e2cef8467c8080e2ce3ca692f5ab782958
MD5 83f226133fa00fb859ba2204352cbcce
BLAKE2b-256 f6396a3dda7a405aaaec6265c6f4a9497c819ed6c5d85c0faf4ab8a9b11b82aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page