llama-index readers github integration
Project description
LlamaIndex Readers Integration: Github
pip install llama-index-readers-github
The github readers package consists of three separate readers:
- Repository Reader
- Issues Reader
- Collaborators Reader
Authentication
The readers support two authentication methods:
1. Personal Access Token (PAT)
Generate a token under your account settings at https://github.com/settings/tokens
from llama_index.readers.github import GithubClient
# Direct token
client = GithubClient(github_token="ghp_your_token_here")
# Or via environment variable
import os
os.environ["GITHUB_TOKEN"] = "ghp_your_token_here"
client = GithubClient() # Automatically uses GITHUB_TOKEN
2. GitHub App Authentication
For better security, rate limits, and organization-level access, use GitHub App authentication:
from llama_index.readers.github import GithubClient, GitHubAppAuth
# Load your GitHub App private key
with open("path/to/private-key.pem", "r") as f:
private_key = f.read()
# Create GitHub App auth handler
app_auth = GitHubAppAuth(
app_id="123456", # Your GitHub App ID
private_key=private_key, # Private key content (PEM format)
installation_id="789012", # Installation ID for the target org/repo
)
# Use with any client
client = GithubClient(github_app_auth=app_auth)
Installation for GitHub App support:
pip install llama-index-readers-github[github-app]
Benefits of GitHub App authentication:
- Higher rate limits: 5,000 requests/hour per installation (vs 5,000/hour for PAT)
- Fine-grained permissions: Repository-specific access control
- Better security: Tokens auto-expire after 1 hour
- Organization-level: Can be installed across multiple repositories
- Auditability: Actions attributed to the app, not individual users
Repository Reader
This reader will read through a repo, with options to specifically filter directories, file extensions, file paths, and custom processing logic.
Basic Usage
from llama_index.readers.github import GithubRepositoryReader, GithubClient
client = github_client = GithubClient(github_token=github_token, verbose=False)
reader = GithubRepositoryReader(
github_client=github_client,
owner="run-llama",
repo="llama_index",
use_parser=False,
verbose=True,
filter_directories=(
["docs"],
GithubRepositoryReader.FilterType.INCLUDE,
),
filter_file_extensions=(
[
".png",
".jpg",
".jpeg",
".gif",
".svg",
".ico",
"json",
".ipynb",
],
GithubRepositoryReader.FilterType.EXCLUDE,
),
)
documents = reader.load_data(branch="main")
Advanced Filtering Options
Filter Specific File Paths
# Include only specific files
reader = GithubRepositoryReader(
github_client=github_client,
owner="run-llama",
repo="llama_index",
filter_file_paths=(
["README.md", "src/main.py", "docs/guide.md"],
GithubRepositoryReader.FilterType.INCLUDE,
),
)
# Exclude specific files
reader = GithubRepositoryReader(
github_client=github_client,
owner="run-llama",
repo="llama_index",
filter_file_paths=(
["tests/test_file.py", "temp/cache.txt"],
GithubRepositoryReader.FilterType.EXCLUDE,
),
)
Custom File Processing Callback
def process_file_callback(file_path: str, file_size: int) -> tuple[bool, str]:
"""Custom logic to determine if a file should be processed.
Args:
file_path: The full path to the file
file_size: The size of the file in bytes
Returns:
Tuple of (should_process: bool, reason: str)
"""
# Skip large files
if file_size > 1024 * 1024: # 1MB
return False, f"File too large: {file_size} bytes"
# Skip test files
if "test" in file_path.lower():
return False, "Skipping test files"
# Skip binary files by extension
binary_extensions = [".exe", ".bin", ".so", ".dylib"]
if any(file_path.endswith(ext) for ext in binary_extensions):
return False, "Skipping binary files"
return True, ""
reader = GithubRepositoryReader(
github_client=github_client,
owner="run-llama",
repo="llama_index",
process_file_callback=process_file_callback,
fail_on_error=False, # Continue processing if callback fails
)
Custom Folder for Temporary Files
from llama_index.core.readers.base import BaseReader
# Custom parser for specific file types
class CustomMarkdownParser(BaseReader):
def load_data(self, file_path, extra_info=None):
# Custom parsing logic here
pass
reader = GithubRepositoryReader(
github_client=github_client,
owner="run-llama",
repo="llama_index",
use_parser=True,
custom_parsers={".md": CustomMarkdownParser()},
custom_folder="/tmp/github_processing", # Custom temp directory
)
Event System Integration
The reader integrates with LlamaIndex's instrumentation system to provide detailed events during processing:
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.readers.github.repository.event import (
GitHubFileProcessedEvent,
GitHubFileSkippedEvent,
GitHubFileFailedEvent,
GitHubRepositoryProcessingStartedEvent,
GitHubRepositoryProcessingCompletedEvent,
)
class GitHubEventHandler(BaseEventHandler):
def handle(self, event):
if isinstance(event, GitHubRepositoryProcessingStartedEvent):
print(f"Started processing repository: {event.repository_name}")
elif isinstance(event, GitHubFileProcessedEvent):
print(
f"Processed file: {event.file_path} ({event.file_size} bytes)"
)
elif isinstance(event, GitHubFileSkippedEvent):
print(f"Skipped file: {event.file_path} - {event.reason}")
elif isinstance(event, GitHubFileFailedEvent):
print(f"Failed to process file: {event.file_path} - {event.error}")
elif isinstance(event, GitHubRepositoryProcessingCompletedEvent):
print(
f"Completed processing. Total documents: {event.total_documents}"
)
# Register the event handler
dispatcher = get_dispatcher()
handler = GitHubEventHandler()
dispatcher.add_event_handler(handler)
# Use the reader - events will be automatically dispatched
reader = GithubRepositoryReader(
github_client=github_client,
owner="run-llama",
repo="llama_index",
)
documents = reader.load_data(branch="main")
Available Events
The following events are dispatched during repository processing:
-
GitHubRepositoryProcessingStartedEvent: Fired when repository processing beginsrepository_name: Name of the repository (owner/repo)branch_or_commit: Branch name or commit SHA being processed
-
GitHubRepositoryProcessingCompletedEvent: Fired when repository processing completesrepository_name: Name of the repositorybranch_or_commit: Branch name or commit SHAtotal_documents: Number of documents created
-
GitHubTotalFilesToProcessEvent: Fired with the total count of files to be processedrepository_name: Name of the repositorybranch_or_commit: Branch name or commit SHAtotal_files: Total number of files found
-
GitHubFileProcessingStartedEvent: Fired when individual file processing startsfile_path: Path to the file being processedfile_type: File extension
-
GitHubFileProcessedEvent: Fired when a file is successfully processedfile_path: Path to the processed filefile_type: File extensionfile_size: Size of the file in bytesdocument: The created Document object
-
GitHubFileSkippedEvent: Fired when a file is skippedfile_path: Path to the skipped filefile_type: File extensionreason: Reason why the file was skipped
-
GitHubFileFailedEvent: Fired when file processing failsfile_path: Path to the failed filefile_type: File extensionerror: Error message describing the failure
Issues Reader
from llama_index.readers.github import (
GitHubRepositoryIssuesReader,
GitHubIssuesClient,
)
github_client = GitHubIssuesClient(github_token=github_token, verbose=True)
reader = GitHubRepositoryIssuesReader(
github_client=github_client,
owner="moncho",
repo="dry",
verbose=True,
)
documents = reader.load_data(
state=GitHubRepositoryIssuesReader.IssueState.ALL,
labelFilters=[("bug", GitHubRepositoryIssuesReader.FilterType.INCLUDE)],
)
Collaborators Reader
from llama_index.readers.github import (
GitHubRepositoryCollaboratorsReader,
GitHubCollaboratorsClient,
)
github_client = GitHubCollaboratorsClient(
github_token=github_token, verbose=True
)
reader = GitHubRepositoryCollaboratorsReader(
github_client=github_client,
owner="moncho",
repo="dry",
verbose=True,
)
documents = reader.load_data()
GitHub App Setup Guide
To create and configure a GitHub App for authentication:
1. Create a GitHub App
- Go to your GitHub account settings → Developer settings → GitHub Apps → New GitHub App
- Fill in the required information:
- GitHub App name: Choose a unique name (e.g., "My LlamaIndex Reader")
- Homepage URL: Your application or organization URL
- Webhook: Uncheck "Active" (not needed for this use case)
2. Set Permissions
Under Repository permissions, set:
- Contents: Read-only (to read repository files)
- Metadata: Read-only (required automatically)
- Issues: Read-only (if using Issues reader)
- Pull requests: Read-only (issues endpoint includes PRs)
3. Install the App
- After creating the app, note your App ID (shown at the top)
- Generate a private key:
- Scroll down to "Private keys"
- Click "Generate a private key"
- Save the downloaded
.pemfile securely
- Install the app:
- Click "Install App" in the left sidebar
- Choose the account/organization
- Select specific repositories or all repositories
- Complete installation
4. Get Installation ID
After installation, you'll be redirected to a URL like:
https://github.com/settings/installations/12345678
The number 12345678 is your installation ID. You can also find it via the API:
curl -H "Authorization: Bearer YOUR_JWT_TOKEN" \
https://api.github.com/app/installations
5. Use in Code
from llama_index.readers.github import GithubClient, GitHubAppAuth
# Load private key
with open("path/to/your-app-private-key.pem", "r") as f:
private_key = f.read()
# Create auth handler
app_auth = GitHubAppAuth(
app_id="YOUR_APP_ID",
private_key=private_key,
installation_id="YOUR_INSTALLATION_ID",
)
# Use with any client
client = GithubClient(github_app_auth=app_auth)
Token Management
The GitHubAppAuth class automatically:
- Generates JWTs for app authentication
- Obtains installation access tokens
- Caches tokens (valid for 1 hour)
- Refreshes tokens automatically when they expire or are within 5 minutes of expiry
You can manually invalidate a token if needed:
app_auth.invalidate_token() # Forces refresh on next request
Troubleshooting
"Failed to get installation token: 401"
- Verify your App ID is correct
- Ensure the private key matches your GitHub App
- Check that the app is installed for the target repository
"Failed to get installation token: 404"
- Verify the installation ID is correct
- Ensure the app installation wasn't uninstalled
"Import PyJWT failed"
- Install GitHub App support:
pip install llama-index-readers-github[github-app]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_github-0.10.0.tar.gz.
File metadata
- Download URL: llama_index_readers_github-0.10.0.tar.gz
- Upload date:
- Size: 25.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99e54de665429e8024984610553ce043750c56b1f55b88e57aa404cf34ef40d2
|
|
| MD5 |
ede8decb204facdf8d2d851e5e210555
|
|
| BLAKE2b-256 |
18c0c685d50da2639d1981e10b7447de28f71213e1b5da62860f2657f809da00
|
File details
Details for the file llama_index_readers_github-0.10.0-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_github-0.10.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c33a028d2a534bbb911744b0ccb676b8e1ea93d19f98a7f3c9e9ed4bfb52162
|
|
| MD5 |
06dbfabd5a56050b52cd3c2f0d71d8f9
|
|
| BLAKE2b-256 |
d0efdb850e3294f423241db83737d071e351f28ed4af2424db36a9a09de0b4b4
|