Skip to main content

Gittxt: Get text from Git repositories in AI-ready formats

Project description

๐Ÿš€ LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines

๐Ÿ“ Gittxt: Get text from Git repositories in AI-ready formats.

Python Version PyPI version Release Tested with Pytest PyPI Downloads GitHub repo size GitHub top language Build Status Made for LLMs Linted with Ruff License


โœจ What is Gittxt?

Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.

Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:

  • Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
  • Documentation extraction for knowledge bases
  • Code summarization pipelines
  • Repository analysis for machine learning workflows

๐Ÿš€ Features

  • โœ… Dynamic File-Type Filtering (--file-types=code,docs,images,csv,media,all)
  • โœ… Automatic Tree Generation with clean filtering (excludes .git/, __pycache__, etc.)
  • โœ… Multiple Output Formats: TXT, JSON, Markdown
  • โœ… Optional ZIP Packaging for non-text assets
  • โœ… CLI-friendly Progress Bars
  • โœ… Built-in Summary Reports (--summary)
  • โœ… Interactive & CI-ready Modes (--non-interactive)

๐Ÿ—๏ธ Installation

๐Ÿ“ฆ Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install

๐Ÿ Using pip (stable)

pip install gittxt

โš™๏ธ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --file-types code,docs --summary

๐Ÿ‘‰ This will:

  • Scan a GitHub repository
  • Extract code & docs files
  • Output .txt + .json summaries
  • Show a summary report

๐Ÿ–ฅ๏ธ CLI Usage

gittxt scan [REPOS]... [OPTIONS]

Options:
  --include TEXT        Include patterns (e.g., *.py)
  --exclude TEXT        Exclude patterns (e.g., tests/, node_modules)
  --size-limit INTEGER  Max file size in bytes
  --branch TEXT         Specify branch (for GitHub URLs)
  --file-types TEXT     code, docs, images, csv, media, all
  --output-format TEXT  txt, json, md, or comma-separated list
  --output-dir PATH     Custom output directory
  --summary             Show post-scan summary
  --non-interactive     Skip prompts for CI/CD workflows
  --progress            Enable scan progress bars
  --debug               Enable debug logs
  --help                Show this message and exit

๐Ÿ“‚ Output Structure

<output_dir>/
โ”œโ”€โ”€ text/
โ”‚   โ””โ”€โ”€ repo-name.txt
โ”œโ”€โ”€ json/
โ”‚   โ””โ”€โ”€ repo-name.json
โ”œโ”€โ”€ md/
โ”‚   โ””โ”€โ”€ repo-name.md
โ””โ”€โ”€ zips/
    โ””โ”€โ”€ repo-name_bundle.zip  # Optional ZIP for assets (images, csv, etc.)

๐Ÿ›  How It Works

  1. ๐Ÿ”— Clone GitHub/local repo (supports branch/subdir URLs)
  2. ๐ŸŒณ Dynamically generate directory tree (excluding .git, __pycache__, etc.)
  3. ๐Ÿ—‚๏ธ Filter files based on type (code, docs, csv, media)
  4. ๐Ÿ“ Generate formatted outputs (TXT, JSON, MD)
  5. ๐Ÿ“ฆ Package assets (optional ZIP for non-text)
  6. ๐Ÿงน Cleanup temporary files (cache-free design)

๐Ÿ“Š Example Summary Output

๐Ÿ“Š Summary Report:
 - Total files processed: 45
 - Output formats: txt, json
 - File type breakdown: {'code': 31, 'docs': 14}

๐Ÿ” Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy


๐Ÿค Contributing

We welcome community contributions!


๐Ÿ›ฃ๏ธ Roadmap

  • FastAPI-powered web UI
  • AI-powered summaries (GPT/OpenAI integration)
  • Support YAML/CSV as additional output formats
  • Async file scanning (speed boost)

๐Ÿ“„ License

MIT License ยฉ Sandeep Paidipati


Gittxt โ€” โ€œGittxt: Get text from Git repositories in AI-ready formats.โ€


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-1.5.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-1.5.0-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-1.5.0.tar.gz.

File metadata

  • Download URL: gittxt-1.5.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.5.0.tar.gz
Algorithm Hash digest
SHA256 4dd6f7b97c73416a00fd4aa274f9c6714d87af043c69228b1680a2e040dbe3a6
MD5 79597cf0d6c92fa5b181663e93bede44
BLAKE2b-256 f7a2051e830d42d68d98c3db8cb6523ce8c23280e6607ce8c44499278da4d68f

See more details on using hashes here.

File details

Details for the file gittxt-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: gittxt-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b6be8bc54555a4acf130616cf9674edf6ad1928e5c099633d6209ffffffe883
MD5 476c8a8de8c5901ce4fee92a3302d55c
BLAKE2b-256 95005e355ab4c8c26a6f3f61b523bbf891566a9ad5ff343d51184555559fc879

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page