Skip to main content

llama-index readers github integration

Project description

LlamaIndex Readers Integration: Github

pip install llama-index-readers-github

The github readers package consists of three separate readers:

  1. Repository Reader
  2. Issues Reader
  3. Collaborators Reader

Authentication

The readers support two authentication methods:

1. Personal Access Token (PAT)

Generate a token under your account settings at https://github.com/settings/tokens

from llama_index.readers.github import GithubClient

# Direct token
client = GithubClient(github_token="ghp_your_token_here")

# Or via environment variable
import os

os.environ["GITHUB_TOKEN"] = "ghp_your_token_here"
client = GithubClient()  # Automatically uses GITHUB_TOKEN

2. GitHub App Authentication

For better security, rate limits, and organization-level access, use GitHub App authentication:

from llama_index.readers.github import GithubClient, GitHubAppAuth

# Load your GitHub App private key
with open("path/to/private-key.pem", "r") as f:
    private_key = f.read()

# Create GitHub App auth handler
app_auth = GitHubAppAuth(
    app_id="123456",  # Your GitHub App ID
    private_key=private_key,  # Private key content (PEM format)
    installation_id="789012",  # Installation ID for the target org/repo
)

# Use with any client
client = GithubClient(github_app_auth=app_auth)

Installation for GitHub App support:

pip install llama-index-readers-github[github-app]

Benefits of GitHub App authentication:

  • Higher rate limits: 5,000 requests/hour per installation (vs 5,000/hour for PAT)
  • Fine-grained permissions: Repository-specific access control
  • Better security: Tokens auto-expire after 1 hour
  • Organization-level: Can be installed across multiple repositories
  • Auditability: Actions attributed to the app, not individual users

Repository Reader

This reader will read through a repo, with options to specifically filter directories, file extensions, file paths, and custom processing logic.

Basic Usage

from llama_index.readers.github import GithubRepositoryReader, GithubClient

client = github_client = GithubClient(github_token=github_token, verbose=False)

reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    use_parser=False,
    verbose=True,
    filter_directories=(
        ["docs"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
    filter_file_extensions=(
        [
            ".png",
            ".jpg",
            ".jpeg",
            ".gif",
            ".svg",
            ".ico",
            "json",
            ".ipynb",
        ],
        GithubRepositoryReader.FilterType.EXCLUDE,
    ),
)

documents = reader.load_data(branch="main")

Advanced Filtering Options

Filter Specific File Paths

# Include only specific files
reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    filter_file_paths=(
        ["README.md", "src/main.py", "docs/guide.md"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
)

# Exclude specific files
reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    filter_file_paths=(
        ["tests/test_file.py", "temp/cache.txt"],
        GithubRepositoryReader.FilterType.EXCLUDE,
    ),
)

Custom File Processing Callback

def process_file_callback(file_path: str, file_size: int) -> tuple[bool, str]:
    """Custom logic to determine if a file should be processed.

    Args:
        file_path: The full path to the file
        file_size: The size of the file in bytes

    Returns:
        Tuple of (should_process: bool, reason: str)
    """
    # Skip large files
    if file_size > 1024 * 1024:  # 1MB
        return False, f"File too large: {file_size} bytes"

    # Skip test files
    if "test" in file_path.lower():
        return False, "Skipping test files"

    # Skip binary files by extension
    binary_extensions = [".exe", ".bin", ".so", ".dylib"]
    if any(file_path.endswith(ext) for ext in binary_extensions):
        return False, "Skipping binary files"

    return True, ""


reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    process_file_callback=process_file_callback,
    fail_on_error=False,  # Continue processing if callback fails
)

Custom Folder for Temporary Files

from llama_index.core.readers.base import BaseReader


# Custom parser for specific file types
class CustomMarkdownParser(BaseReader):
    def load_data(self, file_path, extra_info=None):
        # Custom parsing logic here
        pass


reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    use_parser=True,
    custom_parsers={".md": CustomMarkdownParser()},
    custom_folder="/tmp/github_processing",  # Custom temp directory
)

Event System Integration

The reader integrates with LlamaIndex's instrumentation system to provide detailed events during processing:

from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.readers.github.repository.event import (
    GitHubFileProcessedEvent,
    GitHubFileSkippedEvent,
    GitHubFileFailedEvent,
    GitHubRepositoryProcessingStartedEvent,
    GitHubRepositoryProcessingCompletedEvent,
)


class GitHubEventHandler(BaseEventHandler):
    def handle(self, event):
        if isinstance(event, GitHubRepositoryProcessingStartedEvent):
            print(f"Started processing repository: {event.repository_name}")
        elif isinstance(event, GitHubFileProcessedEvent):
            print(
                f"Processed file: {event.file_path} ({event.file_size} bytes)"
            )
        elif isinstance(event, GitHubFileSkippedEvent):
            print(f"Skipped file: {event.file_path} - {event.reason}")
        elif isinstance(event, GitHubFileFailedEvent):
            print(f"Failed to process file: {event.file_path} - {event.error}")
        elif isinstance(event, GitHubRepositoryProcessingCompletedEvent):
            print(
                f"Completed processing. Total documents: {event.total_documents}"
            )


# Register the event handler
dispatcher = get_dispatcher()
handler = GitHubEventHandler()
dispatcher.add_event_handler(handler)

# Use the reader - events will be automatically dispatched
reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
)
documents = reader.load_data(branch="main")

Available Events

The following events are dispatched during repository processing:

  • GitHubRepositoryProcessingStartedEvent: Fired when repository processing begins

    • repository_name: Name of the repository (owner/repo)
    • branch_or_commit: Branch name or commit SHA being processed
  • GitHubRepositoryProcessingCompletedEvent: Fired when repository processing completes

    • repository_name: Name of the repository
    • branch_or_commit: Branch name or commit SHA
    • total_documents: Number of documents created
  • GitHubTotalFilesToProcessEvent: Fired with the total count of files to be processed

    • repository_name: Name of the repository
    • branch_or_commit: Branch name or commit SHA
    • total_files: Total number of files found
  • GitHubFileProcessingStartedEvent: Fired when individual file processing starts

    • file_path: Path to the file being processed
    • file_type: File extension
  • GitHubFileProcessedEvent: Fired when a file is successfully processed

    • file_path: Path to the processed file
    • file_type: File extension
    • file_size: Size of the file in bytes
    • document: The created Document object
  • GitHubFileSkippedEvent: Fired when a file is skipped

    • file_path: Path to the skipped file
    • file_type: File extension
    • reason: Reason why the file was skipped
  • GitHubFileFailedEvent: Fired when file processing fails

    • file_path: Path to the failed file
    • file_type: File extension
    • error: Error message describing the failure

Issues Reader

from llama_index.readers.github import (
    GitHubRepositoryIssuesReader,
    GitHubIssuesClient,
)

github_client = GitHubIssuesClient(github_token=github_token, verbose=True)

reader = GitHubRepositoryIssuesReader(
    github_client=github_client,
    owner="moncho",
    repo="dry",
    verbose=True,
)

documents = reader.load_data(
    state=GitHubRepositoryIssuesReader.IssueState.ALL,
    labelFilters=[("bug", GitHubRepositoryIssuesReader.FilterType.INCLUDE)],
)

Collaborators Reader

from llama_index.readers.github import (
    GitHubRepositoryCollaboratorsReader,
    GitHubCollaboratorsClient,
)

github_client = GitHubCollaboratorsClient(
    github_token=github_token, verbose=True
)

reader = GitHubRepositoryCollaboratorsReader(
    github_client=github_client,
    owner="moncho",
    repo="dry",
    verbose=True,
)

documents = reader.load_data()

GitHub App Setup Guide

To create and configure a GitHub App for authentication:

1. Create a GitHub App

  1. Go to your GitHub account settings → Developer settings → GitHub Apps → New GitHub App
  2. Fill in the required information:
    • GitHub App name: Choose a unique name (e.g., "My LlamaIndex Reader")
    • Homepage URL: Your application or organization URL
    • Webhook: Uncheck "Active" (not needed for this use case)

2. Set Permissions

Under Repository permissions, set:

  • Contents: Read-only (to read repository files)
  • Metadata: Read-only (required automatically)
  • Issues: Read-only (if using Issues reader)
  • Pull requests: Read-only (issues endpoint includes PRs)

3. Install the App

  1. After creating the app, note your App ID (shown at the top)
  2. Generate a private key:
    • Scroll down to "Private keys"
    • Click "Generate a private key"
    • Save the downloaded .pem file securely
  3. Install the app:
    • Click "Install App" in the left sidebar
    • Choose the account/organization
    • Select specific repositories or all repositories
    • Complete installation

4. Get Installation ID

After installation, you'll be redirected to a URL like:

https://github.com/settings/installations/12345678

The number 12345678 is your installation ID. You can also find it via the API:

curl -H "Authorization: Bearer YOUR_JWT_TOKEN" \
     https://api.github.com/app/installations

5. Use in Code

from llama_index.readers.github import GithubClient, GitHubAppAuth

# Load private key
with open("path/to/your-app-private-key.pem", "r") as f:
    private_key = f.read()

# Create auth handler
app_auth = GitHubAppAuth(
    app_id="YOUR_APP_ID",
    private_key=private_key,
    installation_id="YOUR_INSTALLATION_ID",
)

# Use with any client
client = GithubClient(github_app_auth=app_auth)

Token Management

The GitHubAppAuth class automatically:

  • Generates JWTs for app authentication
  • Obtains installation access tokens
  • Caches tokens (valid for 1 hour)
  • Refreshes tokens automatically when they expire or are within 5 minutes of expiry

You can manually invalidate a token if needed:

app_auth.invalidate_token()  # Forces refresh on next request

Troubleshooting

"Failed to get installation token: 401"

  • Verify your App ID is correct
  • Ensure the private key matches your GitHub App
  • Check that the app is installed for the target repository

"Failed to get installation token: 404"

  • Verify the installation ID is correct
  • Ensure the app installation wasn't uninstalled

"Import PyJWT failed"

  • Install GitHub App support: pip install llama-index-readers-github[github-app]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_github-0.11.0.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_github-0.11.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_github-0.11.0.tar.gz.

File metadata

  • Download URL: llama_index_readers_github-0.11.0.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_github-0.11.0.tar.gz
Algorithm Hash digest
SHA256 4c377bef0fcafdc888525c88eba8ba6a31ec2fef442b6757ade9b65b0fe6b2ba
MD5 97a0533a36197f86e455368c0d57ac31
BLAKE2b-256 3d095883d16c1f079f208eed1acf0835c3f25d2939636a42692dbfb77f12eda5

See more details on using hashes here.

File details

Details for the file llama_index_readers_github-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: llama_index_readers_github-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 33.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_github-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 499473a8f6f252eaba76942aa0e5acb08c5ba9993e3783be25fba3a42d98b770
MD5 836945d294e3c34f21cd2f3f01ecf4bc
BLAKE2b-256 b60324695310fd01b2970afcf2a6338526e4787df2b889acd7087ab784800aa3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page