Skip to main content

Gittxt: Get text from Git repositories in AI-ready formats

Project description

๐Ÿš€ AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

๐Ÿ“ Gittxt: Get text from Git repositories in AI-ready formats

Python Version PyPI version Release Tested with Pytest PyPI Downloads GitHub repo size GitHub top language Build Status Made for LLMs License


โœจ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • ๐Ÿ” Curating LLM training data from source code
  • ๐Ÿ—ƒ๏ธ Converting repos into structured .txt, .json, .md, and .zip outputs
  • ๐Ÿ“‘ Extracting docs, comments, and markdown files from large monorepos
  • ๐Ÿง  Analyzing repositories by token counts, file size, and content types
  • ๐Ÿ“ฆ Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


๐Ÿš€ Features

  • โœ… Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • โœ… Smart Directory Tree Summaries with depth and exclude support
  • โœ… Multiple Output Formats: .txt, .json, .md, .zip
  • โœ… Lite Mode (--lite) for fast, minimal reports
  • โœ… ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • โœ… Rich Summary Tables with size, token, and type breakdowns
  • โœ… .gittxtignore support for repo-specific exclusions
  • โœ… Async File I/O for efficient scanning

๐Ÿ—๏ธ Installation

๐Ÿ“ฆ Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional setup
poetry run gittxt install

๐Ÿ Using pip (stable)

pip install gittxt

โš™๏ธ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

๐Ÿ‘‰ This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary

More examples โ†’ Usage Examples


๐Ÿ–ฅ๏ธ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

๐Ÿ“ฆ Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


๐Ÿ“ฆ Output Formats

Each scan produces structured outputs:

<output_dir>/
โ”œโ”€โ”€ text/              # .txt
โ”œโ”€โ”€ json/              # .json
โ”œโ”€โ”€ md/                # .md
โ”œโ”€โ”€ zips/              # .zip (optional)
โ”‚   โ””โ”€โ”€ manifest.json, summary.json, outputs/, assets/

See Formats Guide


๐Ÿ›  How It Works

  1. ๐Ÿ”— Clone repo (local or GitHub, with branch/subdir support)
  2. ๐ŸŒฒ Walk repo with filtering and MIME rules
  3. ๐Ÿ“‘ Classify TEXTUAL vs NON-TEXTUAL
  4. ๐Ÿ“ Format output to .txt, .json, .md
  5. ๐Ÿ“ฆ Bundle ZIP with summary + manifest (optional)
  6. ๐Ÿงน Clean temp state after scan

๐Ÿงฐ Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


๐Ÿ“„ Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details โ†’ docs/CONFIGURATION.md


๐Ÿ” Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


๐Ÿค Contributing

We welcome contributions from the community!


๐Ÿ›ฃ๏ธ Roadmap

  • โœ… Async file scanning
  • โœ… ZIP archive export with manifest
  • โœ… Lite mode output
  • โณ AI-powered summaries (GPT, Claude)
  • โณ YAML + CSV output support
  • โณ Web UI via FastAPI

๐Ÿ“„ License

MIT License ยฉ Sandeep Paidipati


Gittxt โ€” Get text from Git repositories in AI-ready formats.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-1.6.0.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-1.6.0-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-1.6.0.tar.gz.

File metadata

  • Download URL: gittxt-1.6.0.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.6.0.tar.gz
Algorithm Hash digest
SHA256 b1c1f38a71d87e1affaeaa62d11399b23ecc5ae3f9091126de71c0c2e6cf1880
MD5 28a0e7ec646684de15f0fcd5217ac5d2
BLAKE2b-256 60dbd5a466332eddbd1a7909abe71037a03306f6bfc3f7174ff6d434729c5de4

See more details on using hashes here.

File details

Details for the file gittxt-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: gittxt-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f7755601ea9cde63e312c3e43e82459abbb0257ce13531d735d5823555ec3a8b
MD5 33dd7f7dc69e7e0c0fb685530eb8c03e
BLAKE2b-256 c6e46d25c8f0e780e6cd391c3a83199bae9da4faf965b973fec1272d89645df0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page