Skip to main content

Gittxt: Get text from Git repositories in AI-ready formats

Project description

๐Ÿš€ AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

๐Ÿ“ Gittxt: Get text from Git repositories in AI-ready formats

Python Version PyPI version Release Tested with Pytest PyPI Downloads GitHub repo size GitHub top language Build Status Made for LLMs License


โœจ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • ๐Ÿ” Curating LLM training data from source code
  • ๐Ÿ—ƒ๏ธ Converting repos into structured .txt, .json, .md, and .zip outputs
  • ๐Ÿ“‘ Extracting docs, comments, and markdown files from large monorepos
  • ๐Ÿง  Analyzing repositories by token counts, file size, and content types
  • ๐Ÿ“ฆ Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


๐Ÿš€ Features

  • โœ… Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • โœ… Smart Directory Tree Summaries with depth and exclude support
  • โœ… Multiple Output Formats: .txt, .json, .md, .zip
  • โœ… Lite Mode (--lite) for fast, minimal reports
  • โœ… ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • โœ… Rich Summary Tables with size, token, and type breakdowns
  • โœ… .gittxtignore support for repo-specific exclusions
  • โœ… Async File I/O for efficient scanning
  • โœ… Reverse Engineering (gittxt re) to reconstruct repositories from reports

๐Ÿ—๏ธ Installation

๐Ÿ Using pip (stable)

pip install gittxt

๐Ÿ“ฆ Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install

โš™๏ธ Quickstart Example

# Scan and bundle
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

# Reverse engineer from report
gittxt re exports/gittxt_summary.txt

๐Ÿ‘‰ This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary
  • Reconstruct original files and structure from a Gittxt report

More examples โ†’ Usage Examples


๐Ÿ–ฅ๏ธ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

๐Ÿ“ฆ Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


Reverse Engineer Command

gittxt re [OPTIONS] REPORT_FILE

๐Ÿ”„ Reconstruct original files and structure from Gittxt .txt, .md, or .json reports. Outputs a ZIP with recovered content.

Options

Option Description
-o, --output-dir Custom output directory for reconstructed files

Example Usage

gittxt re path/to/report.txt

This will:

  • Take a Gittxt-generated report (.txt, .md, or .json)
  • Reconstruct the original file structure as a ZIP archive
  • Save the ZIP to the specified output directory or the current directory by default

๐Ÿ“˜ Learn more โ†’ Reverse Engineering Guide


๐Ÿ“ฆ Output Formats

Each scan produces structured outputs:

<output_dir>/
โ”œโ”€โ”€ text/              # .txt
โ”œโ”€โ”€ json/              # .json
โ”œโ”€โ”€ md/                # .md
โ”œโ”€โ”€ zips/              # .zip (optional)
โ”‚   โ””โ”€โ”€ manifest.json, summary.json, outputs/, assets/

See Formats Guide


๐Ÿ›  How It Works

  1. ๐Ÿ”— Clone repo (local or GitHub, with branch/subdir support)
  2. ๐ŸŒฒ Walk repo with filtering and MIME rules
  3. ๐Ÿ“‘ Classify TEXTUAL vs NON-TEXTUAL
  4. ๐Ÿ“ Format output to .txt, .json, .md
  5. ๐Ÿ“ฆ Bundle ZIP with summary + manifest (optional)
  6. ๐Ÿงน Clean temp state after scan

๐Ÿงฐ Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt config install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


๐Ÿ“„ Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details โ†’ docs/CONFIGURATION.md


๐Ÿ” Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


๐Ÿค Contributing

We welcome contributions from the community!


๐Ÿ›ฃ๏ธ Roadmap

  • โœ… Async file scanning
  • โœ… ZIP archive export with manifest
  • โœ… Lite mode output
  • โณ AI-powered summaries (GPT, Claude)
  • โณ YAML + CSV output support
  • โณ Web UI via FastAPI

๐Ÿ“„ License

MIT License ยฉ Sandeep Paidipati


Gittxt โ€” Get text from Git repositories in AI-ready formats.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-1.7.3.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-1.7.3-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-1.7.3.tar.gz.

File metadata

  • Download URL: gittxt-1.7.3.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.7.3.tar.gz
Algorithm Hash digest
SHA256 71f8f00f28f1f80a4f716afa863bca4e57ecdfa8a406ac645701b82075d8e4e6
MD5 a16ac75ebed5a4c0ae49f78070c14573
BLAKE2b-256 6c08d2e72db4b0d04c791fa6ba64a76f46ae91ffff22e105c7b7ccc1fafbe5dc

See more details on using hashes here.

File details

Details for the file gittxt-1.7.3-py3-none-any.whl.

File metadata

  • Download URL: gittxt-1.7.3-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dbf6ee9ecd7be9a869b235194dee28641cd012b4754ba904f175007207f38c2b
MD5 61e7adc90bb1407cbdd892e06437b4a3
BLAKE2b-256 00497569d6ce0a409e8eaf35109bc57730736d6041ae1fdc5414821956046527

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page