Skip to main content

Gittxt: Get text from Git repositories in AI-ready formats

Project description

๐Ÿš€ LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines

๐Ÿ“ Gittxt: Get text from Git repositories in AI-ready formats

Python Version PyPI version Release Tested with Pytest PyPI Downloads GitHub repo size GitHub top language Build Status Made for LLMs Linted with Ruff License


โœจ What is Gittxt?

Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.

Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:

  • Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
  • Documentation extraction for knowledge bases
  • Code summarization pipelines
  • Repository analysis for machine learning workflows

๐Ÿš€ Features

  • โœ… Dynamic File-Type Filtering (based on extension + MIME + content heuristics)
  • โœ… Smart Directory Tree Summaries with configurable depth and excludes
  • โœ… Multiple Output Formats: .txt, .json, .md, .zip
  • โœ… Lite Mode (--lite) for fast, minimal reports
  • โœ… ZIP Bundling with --zip including summary.json and assets
  • โœ… Rich Summary Tables with size, tokens, and file breakdowns
  • โœ… .gittxtignore support for per-repo custom exclusion
  • โœ… Async I/O and CLI Progress Bars for performance and UX

๐Ÿ—๏ธ Installation

๐Ÿ“ฆ Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install

๐Ÿ Using pip (stable)

pip install gittxt

โš™๏ธ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

๐Ÿ‘‰ This will:

  • Scan the repository root
  • Output .txt + .json summary files
  • Bundle them in a ZIP

For more real-world usage: Usage Examples โ†’


๐Ÿ–ฅ๏ธ CLI Usage

gittxt scan [REPOS]... [OPTIONS]

Common Flags

Option Description
--include-patterns Glob to include (e.g., *.py, docs/**/*.md)
--exclude-patterns Glob to exclude (e.g., tests/, *.zip)
--size-limit Skip files larger than N bytes
--branch Use a specific branch for remote repos
--zip Create a bundled ZIP archive
--lite Minimal output without full content
--output-dir Where to write outputs
--output-format txt, json, md, or comma-separated list

Run gittxt scan --help for the full CLI reference.


๐Ÿ“ฆ Output Formats

Each scan produces structured outputs:

<output_dir>/
โ”œโ”€โ”€ text/              # .txt
โ”œโ”€โ”€ json/              # .json
โ”œโ”€โ”€ md/                # .md
โ””โ”€โ”€ zips/              # .zip (optional)

See Formats Guide โ†’


๐Ÿ›  How It Works

  1. ๐Ÿ”— Clone repo (supports GitHub, local, subdirs)
  2. ๐ŸŒฒ Walk files with exclusion rules and MIME checks
  3. ๐Ÿ“‘ Classify files as TEXTUAL or NON-TEXTUAL
  4. ๐Ÿ“„ Format text files to .txt, .json, .md
  5. ๐Ÿ“ฆ Zip outputs and assets (optional)
  6. ๐Ÿงน Remove temp files (stateless design)

๐Ÿงช Running Tests

make test
  • Generates a test repo with multiple edge cases
  • Runs full suite with Pytest
  • Cleans up outputs

Test docs โ†’ tests/README.md


๐Ÿ“„ Configuration

  • Override via CLI flags
  • Or set env vars like GITTXT_OUTPUT_DIR
  • .gittxtignore works like .gitignore

Advanced setup โ†’ docs/CONFIGURATION.md


๐Ÿ” Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy


๐Ÿค Contributing

We welcome community contributions!


๐Ÿ›ฃ๏ธ Roadmap

  • โœ… Async file scanning
  • โœ… ZIP archive export with manifest
  • โœ… Lite mode output
  • โณ AI-powered summaries (GPT, Claude)
  • โณ YAML + CSV output support
  • โณ Web UI via FastAPI

๐Ÿ“„ License

MIT License ยฉ Sandeep Paidipati


Gittxt โ€” Get text from Git repositories in AI-ready formats.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-1.5.8.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-1.5.8-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-1.5.8.tar.gz.

File metadata

  • Download URL: gittxt-1.5.8.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.5.8.tar.gz
Algorithm Hash digest
SHA256 5bea580246cd37b68e13ccbccc2f49b09b13ac84029ebfc2b2c8a89897b98ca0
MD5 0554d7d02245d67f4db27460d1219f42
BLAKE2b-256 e2c7fc6f4177e0b6c6c6718dae0c55d892dd0eca28f6d4258196c816759610fb

See more details on using hashes here.

File details

Details for the file gittxt-1.5.8-py3-none-any.whl.

File metadata

  • Download URL: gittxt-1.5.8-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.5.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2ccd3245cc1a19e2e83adaccabd2859dbbb6f15b3cf2276d2c857cf7056aac02
MD5 31348d4aab33da96cea01cff6716584f
BLAKE2b-256 9cadb53dbce65b41e4503c34de483fac2ecca1e619dfdda83a01b17db5a26cf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page