Skip to main content

A tool to extract GitHub repositories into a single file

Project description

GitHub Repository Extractor

This Python script allows you to extract the contents of a GitHub repository into a single text file. It's particularly useful for encapsulating an entire codebase into a single file, facilitating its use with Large Language Models (LLMs) that have high-capacity context memory.

Features

  • Support for both local and remote GitHub repositories
  • Flexible ignore and include lists for files, folders, and extensions
  • Progress bar to track extraction process
  • Handles binary files
  • Option to clone remote repositories temporarily
  • Ideal for preparing codebases for analysis by LLMs

Dependencies

This project requires the following Python packages:

  • pygithub: For interacting with the GitHub API
  • tqdm: For displaying progress bars
  • gitpython: For handling Git operations

You can install these dependencies using pip:

pip install pygithub tqdm gitpython

Installation

  1. Clone this repository:
    git clone https://github.com/yourusername/github-repo-extractor.git
    
  2. Install the required dependencies:
    pip install -r requirements.txt
    

Usage

  1. Import the GitHubRepoExtractor class from the script.
  2. Create an instance of GitHubRepoExtractor with your repository details.
  3. Set ignore and include lists as needed.
  4. Call the extract_to_file() method to start the extraction process.

Example:

from github_repo_extractor import GitHubRepoExtractor

extractor = GitHubRepoExtractor(
    repo_input='https://github.com/username/repo.git',
    access_token='your_github_token'
)

extractor.set_ignore_list(
    files=['.gitignore'],
    folders=['tests', '.github'],
    extensions=['.log']
)

extractor.set_include_list(
    files=['README.md'],
    extensions=['.py'],
    exclusive=True
)

extractor.extract_to_file('output.txt')

Authentication

For optimal usage of this script, instead of prompting for the GitHub authentication token every time, you can use a centralized and easily integratable solution like keyvault. We recommend using the keyvault library available at https://github.com/ltoscano/keyvault.

This approach provides a more secure and centralized way to manage your GitHub token.

Use Case: Preparing Codebases for LLMs

This tool is particularly valuable when working with Large Language Models (LLMs) that have high-capacity context memory. By encapsulating an entire codebase into a single file, you can:

  1. Easily feed the entire codebase into an LLM for analysis, code review, or understanding.
  2. Maintain context across multiple files and directories when discussing code with an LLM.
  3. Simplify the process of asking LLMs to perform tasks that require understanding of the entire project structure.

This approach allows for more comprehensive and context-aware interactions with LLMs when working with large software projects.

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md file for guidelines on how to contribute to this project.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

github_repo_extractor-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

github_repo_extractor-0.1.1-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file github_repo_extractor-0.1.1.tar.gz.

File metadata

  • Download URL: github_repo_extractor-0.1.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for github_repo_extractor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 efbdfa598dfaf5f16d3f6989a133bc20a6ed43a97b0432fbbe2e006434f028d2
MD5 5d9ddd01f684ba0f0edeb4c36750a91d
BLAKE2b-256 6cb85ad39f3d8a40ae339e1111ea5073538a325aa1dcbc1e64d800a9cdc5494a

See more details on using hashes here.

File details

Details for the file github_repo_extractor-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for github_repo_extractor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 afa08911d6d808f37646496b019db065a66422f0ff0fae9bb450e9a03184e50b
MD5 b5a2cf8660e9ee98870d7629f492b058
BLAKE2b-256 0e38763fa4a20a6464b17033662e732a66f5ccf49f684d026970a19a4f867fb0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page