Skip to main content

A tool to extract GitHub repositories into a single file

Project description

GitHub Repository Extractor

This Python script allows you to extract the contents of a GitHub repository into a single text file. It's particularly useful for encapsulating an entire codebase into a single file, facilitating its use with Large Language Models (LLMs) that have high-capacity context memory.

Features

  • Support for both local and remote GitHub repositories
  • Flexible ignore and include lists for files, folders, and extensions
  • Progress bar to track extraction process
  • Handles binary files
  • Option to clone remote repositories temporarily
  • Ideal for preparing codebases for analysis by LLMs

Dependencies

This project requires the following Python packages:

  • pygithub: For interacting with the GitHub API
  • tqdm: For displaying progress bars
  • gitpython: For handling Git operations

You can install these dependencies using pip:

pip install pygithub tqdm gitpython

Installation

  1. Clone this repository:
    git clone https://github.com/yourusername/github-repo-extractor.git
    
  2. Install the required dependencies:
    pip install -r requirements.txt
    

Usage

  1. Import the GitHubRepoExtractor class from the script.
  2. Create an instance of GitHubRepoExtractor with your repository details.
  3. Set ignore and include lists as needed.
  4. Call the extract_to_file() method to start the extraction process.

Example:

from github_repo_extractor import GitHubRepoExtractor

extractor = GitHubRepoExtractor(
    repo_input='https://github.com/username/repo.git',
    access_token='your_github_token'
)

extractor.set_ignore_list(
    files=['.gitignore'],
    folders=['tests', '.github'],
    extensions=['.log']
)

extractor.set_include_list(
    files=['README.md'],
    extensions=['.py'],
    exclusive=True
)

extractor.extract_to_file('output.txt')

Authentication

For optimal usage of this script, instead of prompting for the GitHub authentication token every time, you can use a centralized and easily integratable solution like keyvault. We recommend using the keyvault library available at https://github.com/ltoscano/keyvault.

This approach provides a more secure and centralized way to manage your GitHub token.

Use Case: Preparing Codebases for LLMs

This tool is particularly valuable when working with Large Language Models (LLMs) that have high-capacity context memory. By encapsulating an entire codebase into a single file, you can:

  1. Easily feed the entire codebase into an LLM for analysis, code review, or understanding.
  2. Maintain context across multiple files and directories when discussing code with an LLM.
  3. Simplify the process of asking LLMs to perform tasks that require understanding of the entire project structure.

This approach allows for more comprehensive and context-aware interactions with LLMs when working with large software projects.

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md file for guidelines on how to contribute to this project.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

github_repo_extractor-0.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

github_repo_extractor-0.1.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file github_repo_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: github_repo_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for github_repo_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7cf09ebcf634ff6c629afcb46161f20471219539223c2d97daffb2a53dd0b475
MD5 6f19f6a3016748292cbafcbf43185039
BLAKE2b-256 6aa402f1e922464708b40a4023f00bb55462148dbc79b2eef8bfc70d5bfe6c88

See more details on using hashes here.

File details

Details for the file github_repo_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for github_repo_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b6aedcfb3cf515ab60cd7dac70829d607d3a0632a9f44677494c2685aa409f4b
MD5 475e9ef411bb7aecf5fdc71ae49d208c
BLAKE2b-256 701ef86e73b62f4efe3f0a6f0c2f3cfdc869cfb024bcaec377e947df8b59d5c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page