Skip to main content

CLI tool for extracting text from Git repositories

Project description

📝 Gittxt: Extract Text from Git Repositories

Gittxt is a lightweight CLI tool that scans Git repositories (local or remote) and extracts text content into a consolidated file (.txt or .json).
It is designed for code summarization, AI preprocessing, offline reading, and documentation generation.

🚀 Features

  • Scan Local or Remote Repositories (git clone support)
  • Include & Exclude File Patterns (--include .py, --exclude node_modules)
  • Multi-threaded Scanning (Optimized for large repositories)
  • Supports JSON & TXT Output Formats (--format json)
  • Incremental Caching for Faster Scans (Skips unchanged files)
  • Force Full Rescan When Needed (--force-rescan)

📌 Installation

1️⃣ Clone the Repository

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt

2️⃣ Create & Activate Virtual Environment

python3 -m venv venv
source venv/bin/activate  # For Linux/macOS
venv\Scripts\activate      # For Windows

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Install in Editable Mode (For Development)

pip install -e src/

📌 Usage

1️⃣ Scan a Local Folder

PYTHONPATH=src python src/gittxt/cli.py .

📌 Result: Outputs gittxt_output.txt containing extracted text.


2️⃣ Scan a Remote GitHub Repository

PYTHONPATH=src python src/gittxt/cli.py https://github.com/torvalds/linux

📌 This will:

  • Clone the Linux Kernel repo to a temporary directory.
  • Extract all readable text.
  • Save it in gittxt_output.txt.

3️⃣ Customize Output (JSON & TXT)

Save as JSON (Structured Output)

PYTHONPATH=src python src/gittxt/cli.py . --format json --output repo_dump.json

Save as TXT (Default)

PYTHONPATH=src python src/gittxt/cli.py . --format txt --output repo_dump.txt

4️⃣ Include & Exclude Specific Files

Scan Only Python Files

PYTHONPATH=src python src/gittxt/cli.py . --include .py

Exclude node_modules, .log Files

PYTHONPATH=src python src/gittxt/cli.py . --exclude node_modules --exclude .log

5️⃣ Improve Performance (Multi-threading)

Gittxt automatically optimizes scanning based on repository size.

📌 Want to manually set workers? Use:

PYTHONPATH=src python src/gittxt/cli.py . --workers 8

6️⃣ Caching: Skip Unchanged Files for Faster Scans

Gittxt remembers previously scanned files to avoid redundant processing.

First Scan (Full Processing)

PYTHONPATH=src python src/gittxt/cli.py .

Second Scan (Uses Cache for Faster Results)

PYTHONPATH=src python src/gittxt/cli.py .

🚀 Faster! Skips unchanged files automatically!


7️⃣ Force a Full Rescan (Ignore Cache)

PYTHONPATH=src python src/gittxt/cli.py . --force-rescan

📌 Deletes .gittxt_cache.json and scans everything from scratch.


📌 Development & Contribution

Want to contribute? Follow these steps:

1️⃣ Run Tests

pytest tests/

2️⃣ Formatting & Linting

black src/

3️⃣ Open a Pull Request

  1. Fork the repo
  2. Create a new branch (feature/my-change)
  3. Push changes
  4. Submit a PR! 🚀

📌 License

This project is licensed under the MIT License.


🚀 Next Steps

  • [ ] Improve error handling for edge cases.
  • [ ] Add support for Markdown (.md) output.
  • [ ] Implement a Web UI for visualization.

📌 Made by Sandeep Paidipati


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-0.1.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-0.1.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-0.1.0.tar.gz.

File metadata

  • Download URL: gittxt-0.1.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0f8efb24348e7897bcc7be55bc239b32565856b9324cbc3de8cbf3a87921e2d4
MD5 4b1d1ac7fa80017dfd6d02227e90fbc5
BLAKE2b-256 c53a1ec33fe6ceb2a4d37186881b31fc1e3bd0f4ba4e3540442d181e6f9396e5

See more details on using hashes here.

File details

Details for the file gittxt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gittxt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e7e258d9c67453daa6729bda68a3e111af7d9cc5a3bf962165936d3eabffd8ce
MD5 4aad0806335c1bf8be86badb8a4bb414
BLAKE2b-256 cfc08cb3b1509162967a068145e1cca9a0b29c4146c79726e3fcfc722810aa07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page