Skip to main content

Gittxt: Get text from Git repositories in AI-ready formats

Project description

๐Ÿš€ LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines

๐Ÿ“ Gittxt: Get text from Git repositories in AI-ready formats

Python Version PyPI version Release Tested with Pytest PyPI Downloads GitHub repo size GitHub top language Build Status Made for LLMs Linted with Ruff License


โœจ What is Gittxt?

Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.

Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:

  • Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
  • Documentation extraction for knowledge bases
  • Code summarization pipelines
  • Repository analysis for machine learning workflows

๐Ÿš€ Features

  • โœ… Dynamic File-Type Filtering (based on extension + MIME + content heuristics)
  • โœ… Smart Directory Tree Summaries with configurable depth and excludes
  • โœ… Multiple Output Formats: .txt, .json, .md, .zip
  • โœ… Lite Mode (--lite) for fast, minimal reports
  • โœ… ZIP Bundling with --zip including summary.json and assets
  • โœ… Rich Summary Tables with size, tokens, and file breakdowns
  • โœ… .gittxtignore support for per-repo custom exclusion
  • โœ… Async I/O and CLI Progress Bars for performance and UX

๐Ÿ—๏ธ Installation

๐Ÿ“ฆ Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install

๐Ÿ Using pip (stable)

pip install gittxt

โš™๏ธ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

๐Ÿ‘‰ This will:

  • Scan the repository root
  • Output .txt + .json summary files
  • Bundle them in a ZIP

For more real-world usage: Usage Examples โ†’


๐Ÿ–ฅ๏ธ CLI Usage

gittxt scan [REPOS]... [OPTIONS]

Common Flags

Option Description
--include-patterns Glob to include (e.g., *.py, docs/**/*.md)
--exclude-patterns Glob to exclude (e.g., tests/, *.zip)
--size-limit Skip files larger than N bytes
--branch Use a specific branch for remote repos
--zip Create a bundled ZIP archive
--lite Minimal output without full content
--output-dir Where to write outputs
--output-format txt, json, md, or comma-separated list

Run gittxt scan --help for the full CLI reference.


๐Ÿ“ฆ Output Formats

Each scan produces structured outputs:

<output_dir>/
โ”œโ”€โ”€ text/              # .txt
โ”œโ”€โ”€ json/              # .json
โ”œโ”€โ”€ md/                # .md
โ””โ”€โ”€ zips/              # .zip (optional)

See Formats Guide โ†’


๐Ÿ›  How It Works

  1. ๐Ÿ”— Clone repo (supports GitHub, local, subdirs)
  2. ๐ŸŒฒ Walk files with exclusion rules and MIME checks
  3. ๐Ÿ“‘ Classify files as TEXTUAL or NON-TEXTUAL
  4. ๐Ÿ“„ Format text files to .txt, .json, .md
  5. ๐Ÿ“ฆ Zip outputs and assets (optional)
  6. ๐Ÿงน Remove temp files (stateless design)

๐Ÿงช Running Tests

make test
  • Generates a test repo with multiple edge cases
  • Runs full suite with Pytest
  • Cleans up outputs

Test docs โ†’ tests/README.md


๐Ÿ“„ Configuration

  • Override via CLI flags
  • Or set env vars like GITTXT_OUTPUT_DIR
  • .gittxtignore works like .gitignore

Advanced setup โ†’ docs/CONFIGURATION.md


๐Ÿ” Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy


๐Ÿค Contributing

We welcome community contributions!


๐Ÿ›ฃ๏ธ Roadmap

  • โœ… Async file scanning
  • โœ… ZIP archive export with manifest
  • โœ… Lite mode output
  • โณ AI-powered summaries (GPT, Claude)
  • โณ YAML + CSV output support
  • โณ Web UI via FastAPI

๐Ÿ“„ License

MIT License ยฉ Sandeep Paidipati


Gittxt โ€” Get text from Git repositories in AI-ready formats.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittxt-1.5.9.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gittxt-1.5.9-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file gittxt-1.5.9.tar.gz.

File metadata

  • Download URL: gittxt-1.5.9.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.5.9.tar.gz
Algorithm Hash digest
SHA256 87bea4eede25f9f896245bf25f199f690084a5fbcc0bd7ef0d433dfb76f9aeaa
MD5 cca5d9bbfc25b08af5ccee96f1481752
BLAKE2b-256 ceb5d9c7597fef50db81d20a94cb0054d4ce4fa76a867fb7e07529d153286372

See more details on using hashes here.

File details

Details for the file gittxt-1.5.9-py3-none-any.whl.

File metadata

  • Download URL: gittxt-1.5.9-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for gittxt-1.5.9-py3-none-any.whl
Algorithm Hash digest
SHA256 123edc06b7cac2cde82d99dc5463d1418410988fe3c2279a44989dac50eb3d4e
MD5 97ad72df3e5f61a368f5b9ebd8a381a0
BLAKE2b-256 c36bac9929e8c4817a789f3c9177e21313a2882c7224bb1e787f4fcc4b325b0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page