Skip to main content

Extract repository contents into formatted text for LLM context

Project description

repocontext

Extract repository contents into formatted text for LLM context.

A Python library that fetches GitHub repositories, builds hierarchical file trees, and generates formatted text output with directory structure and file contents including token counts for LLM context limits.

Features

  • GitHub Support: Fetch public and private repositories
  • Async Operations: Efficient file fetching with concurrency control
  • Token Counting: Accurate GPT token counting via tiktoken
  • Extensible: Easy to add new providers (GitLab, Azure DevOps, etc.)
  • Structured Output: Directory trees with file contents in markdown

Installation

pip install repocontext

Quick Start

from repocontext import fetch

# Simple usage - synchronous
result = fetch("https://github.com/owner/repo")
print(result.markdown)
print(f"Tokens: {result.token_count}")

# With file contents
result = fetch("https://github.com/owner/repo", token="ghp_xxx", fetch_content=True)
print(result.markdown)

Usage Examples

Using the Provider Directly

import asyncio
from repocontext import GitHubProvider, Formatter, build_tree

async def main():
    provider = GitHubProvider()
    
    # Fetch repository
    result = await provider.get_repository(
        "https://github.com/owner/repo",
        token=None,
        fetch_content=True
    )
    
    print(result.directory_tree)
    print(f"Found {result.file_count} files")
    print(f"Total tokens: {result.token_count}")

asyncio.run(main())

Filtering Files

import asyncio
from repocontext import GitHubProvider

async def main():
    provider = GitHubProvider()
    nodes = await provider.fetch_tree("https://github.com/owner/repo")
    
    # Get only Python files
    py_files = [n for n in nodes if n.is_file() and n.get_extension() == ".py"]
    
    # Get files from specific directory
    src_files = [n for n in nodes if n.path.startswith("src/")]
    
    # Get files larger than 1KB
    large_files = [n for n in nodes if n.is_file() and n.size and n.size > 1024]

asyncio.run(main())

Building Trees and Formatting

from repocontext import build_tree, Formatter, FileNode, TreeNode

# Build tree from flat nodes
tree = build_tree(
    nodes,
    selected_paths={n.path for n in selected_files},
    excluded_paths=set(),
    expanded_paths={n.path for n in nodes if n.is_directory()},
)

# Format as markdown
markdown = Formatter.format_markdown(tree, contents)

API Reference

Main Function

from repocontext import fetch

result = fetch(
    url="https://github.com/owner/repo",  # Required
    token=None,        # Optional GitHub token for private repos
    fetch_content=False # Set True to include file contents
)

Returns RepositoryResult with:

  • url - The repository URL
  • branch - The resolved branch name
  • files - List of FileNode objects
  • directories - List of directory paths
  • contents - List of FileContent objects (when fetch_content=True)
  • markdown - Full markdown output
  • directory_tree - ASCII tree representation
  • token_count - Total token count
  • line_count - Total line count
  • file_count - Number of files
  • stats - Statistics dictionary

Providers

GitHubProvider

from repocontext import GitHubProvider

provider = GitHubProvider()

# Set credentials (optional for public repos)
provider.set_credentials("ghp_your_token_here")

# Get full repository result
result = await provider.get_repository(
    url="https://github.com/owner/repo",
    token=None,
    fetch_content=False,
    branch=None  # Optional branch override
)

# Fetch tree only
nodes = await provider.fetch_tree(url, branch="main", path="src")

# Fetch multiple files with concurrency
async for content in provider.fetch_multiple(file_nodes):
    print(content.path, len(content.text))

Types

FileNode

from repocontext import FileNode

node = FileNode(path="src/main.py", type="blob", size=1024, sha="abc123")

node.is_file()        # True
node.is_directory()   # False
node.get_name()       # "main.py"
node.get_extension()  # ".py"

TreeNode

from repocontext import TreeNode

node = TreeNode(
    name="src",
    path="src",
    type="directory",
    children=[...],
    selected=True
)

node.is_file()        # False
node.is_directory()   # True

FileContent

from repocontext import FileContent

content = FileContent(
    path="src/main.py",
    text="print('hello')",
    url="https://...",
    line_count=1,
    token_count=3
)

Formatter

from repocontext import Formatter

# Count tokens
tokens = Formatter.count_tokens("hello world")

# Format project tree
tree_str = Formatter.format_project_tree(tree_nodes)

# Format as markdown
markdown = Formatter.format_markdown(tree, contents)

Tree Building

from repocontext import build_tree, extract_directories

# Build hierarchical tree from flat nodes
tree = build_tree(
    nodes,
    selected_paths=set_of_selected_paths,
    excluded_paths=set_of_excluded_paths,
    expanded_paths=set_of_expanded_directories
)

# Extract all directory paths
dirs = extract_directories(nodes)

Exception Handling

from repocontext import (
    InvalidURLError,
    AuthenticationError,
    NotFoundError,
    RateLimitError,
    NetworkError,
)

try:
    result = fetch("https://github.com/owner/repo")
except InvalidURLError as e:
    print(f"Invalid URL: {e.user_message}")
except AuthenticationError as e:
    print(f"Auth failed: {e.user_message}")
except RateLimitError as e:
    print(f"Rate limited: {e.user_message}")
except NetworkError as e:
    print(f"Network error: {e.user_message}")

Extending the Package

Adding a New Provider

Create a new provider by extending BaseProvider:

from repocontext.providers import BaseProvider, register_provider

@register_provider("gitlab")
class GitLabProvider(BaseProvider):
    API_BASE = "https://gitlab.com/api/v4"

    @property
    def get_type(self) -> str:
        return "gitlab"

    @property
    def get_name(self) -> str:
        return "GitLab"

    def requires_auth(self) -> bool:
        return True

    def validate_url(self, url: str) -> bool:
        return url.startswith("https://gitlab.com/")

    def parse_url(self, url: str) -> ParsedRepoInfo:
        # Parse the URL and return ParsedRepoInfo
        ...

    async def _fetch_tree(self, url: str, **options) -> list[FileNode]:
        # Fetch the repository tree
        ...

    async def _fetch_file_content(self, node: FileNode) -> FileContent:
        # Fetch a single file's content
        ...

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo_context_lib-0.2.1.tar.gz (857.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repo_context_lib-0.2.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file repo_context_lib-0.2.1.tar.gz.

File metadata

  • Download URL: repo_context_lib-0.2.1.tar.gz
  • Upload date:
  • Size: 857.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for repo_context_lib-0.2.1.tar.gz
Algorithm Hash digest
SHA256 2ba22fda4fd8d55cc8eca433571ca0f3a20854b99b90b4710447aec064f714ab
MD5 a49481245a2cd4157e2dce9c8112cdfb
BLAKE2b-256 9f006150a6415db5617a749b50d20e78c9b31407e45257deee4899452d204b2a

See more details on using hashes here.

File details

Details for the file repo_context_lib-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for repo_context_lib-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ecf55cf4bd55bd7f6a1ea4b366e80c3a28eca5bccf1bbeb2a2179bac586b78f3
MD5 b5198ff8211d3bf1484313256f10a586
BLAKE2b-256 1324c75677bda2e04528f08f07ee111fd180726b5ac93845414ac63a7893c821

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page