Skip to main content

Gittxt: Get text from Git repositories in AI-ready formats

Project description

๐Ÿš€ AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

๐Ÿ“ Gittxt: Get text from Git repositories in AI-ready formats

Python Version PyPI version Release Tested with Pytest PyPI Downloads GitHub repo size GitHub top language Build Status Made for LLMs License


โœจ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • ๐Ÿ” Curating LLM training data from source code
  • ๐Ÿ—ƒ๏ธ Converting repos into structured .txt, .json, .md, and .zip outputs
  • ๐Ÿ“‘ Extracting docs, comments, and markdown files from large monorepos
  • ๐Ÿง  Analyzing repositories by token counts, file size, and content types
  • ๐Ÿ“ฆ Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


๐Ÿš€ Features

  • โœ… Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • โœ… Smart Directory Tree Summaries with depth and exclude support
  • โœ… Multiple Output Formats: .txt, .json, .md, .zip
  • โœ… Lite Mode (--lite) for fast, minimal reports
  • โœ… ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • โœ… Rich Summary Tables with size, token, and type breakdowns
  • โœ… .gittxtignore support for repo-specific exclusions
  • โœ… Async File I/O for efficient scanning

๐Ÿ—๏ธ Installation

๐Ÿ Using pip (stable)

pip install gittxt

๐Ÿ“ฆ Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install

โš™๏ธ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

๐Ÿ‘‰ This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary

More examples โ†’ Usage Examples


๐Ÿ–ฅ๏ธ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

๐Ÿ“ฆ Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


๐Ÿ“ฆ Output Formats

Each scan produces structured outputs:

<output_dir>/
โ”œโ”€โ”€ text/              # .txt
โ”œโ”€โ”€ json/              # .json
โ”œโ”€โ”€ md/                # .md
โ”œโ”€โ”€ zips/              # .zip (optional)
โ”‚   โ””โ”€โ”€ manifest.json, summary.json, outputs/, assets/

See Formats Guide


๐Ÿ›  How It Works

  1. ๐Ÿ”— Clone repo (local or GitHub, with branch/subdir support)
  2. ๐ŸŒฒ Walk repo with filtering and MIME rules
  3. ๐Ÿ“‘ Classify TEXTUAL vs NON-TEXTUAL
  4. ๐Ÿ“ Format output to .txt, .json, .md
  5. ๐Ÿ“ฆ Bundle ZIP with summary + manifest (optional)
  6. ๐Ÿงน Clean temp state after scan

๐Ÿงฐ Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt config install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


๐Ÿ“„ Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details โ†’ docs/CONFIGURATION.md


๐Ÿ” Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


๐Ÿค Contributing

We welcome contributions from the community!


๐Ÿ›ฃ๏ธ Roadmap

  • โœ… Async file scanning
  • โœ… ZIP archive export with manifest
  • โœ… Lite mode output
  • โณ AI-powered summaries (GPT, Claude)
  • โณ YAML + CSV output support
  • โณ Web UI via FastAPI

๐Ÿ“„ License

MIT License ยฉ Sandeep Paidipati


Gittxt โ€” Get text from Git repositories in AI-ready formats.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-1.7.0.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-1.7.0-py3-none-any.whl (42.1 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-1.7.0.tar.gz.

File metadata

  • Download URL: gittxt-1.7.0.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.7.0.tar.gz
Algorithm Hash digest
SHA256 4d761f963db4cb7a91d561fdbc3ba0c3707b6d978d843a660408c89071c90743
MD5 4d2d766b60616cf4caa2d93d9304f5d6
BLAKE2b-256 0b43a40675f5eab3502351292168cdc44e819289ac4d1277ecafbeb3b6b9b3c8

See more details on using hashes here.

File details

Details for the file gittxt-1.7.0-py3-none-any.whl.

File metadata

  • Download URL: gittxt-1.7.0-py3-none-any.whl
  • Upload date:
  • Size: 42.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eb3543eb3cb3b71ddccfd31eebc9a194e9ab23d3b98b6fdfb5f6bc3638cfd5dc
MD5 8f1d3972b566eac91088d6e45dfc2396
BLAKE2b-256 61817fbfb195cc9374f384f63aaa70ccc5d5e39822ec3929befc74c34f1ea4e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page