Gittxt: Get text from Git repositories in AI-ready formats
Project description
๐ AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling
๐ Gittxt: Get text from Git repositories in AI-ready formats
โจ What is Gittxt?
Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.
With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:
- ๐ Curating LLM training data from source code
- ๐๏ธ Converting repos into structured
.txt,.json,.md, and.zipoutputs - ๐ Extracting docs, comments, and markdown files from large monorepos
- ๐ง Analyzing repositories by token counts, file size, and content types
- ๐ฆ Bundling outputs for reproducibility and downstream pipelines
It supports both local folders and GitHub URLs with branch/subdir targeting.
๐ Features
- โ Dynamic File-Type Filtering (extension + MIME + content heuristics)
- โ Smart Directory Tree Summaries with depth and exclude support
- โ
Multiple Output Formats:
.txt,.json,.md,.zip - โ
Lite Mode (
--lite) for fast, minimal reports - โ
ZIP Bundling with
--zip, includingsummary.json,manifest.json, and assets - โ Rich Summary Tables with size, token, and type breakdowns
- โ .gittxtignore support for repo-specific exclusions
- โ Async File I/O for efficient scanning
- โ
Reverse Engineering (
gittxt re) to reconstruct repositories from reports
๐๏ธ Installation
๐ Using pip (stable)
pip install gittxt
๐ฆ Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install
โ๏ธ Quickstart Example
# Scan and bundle
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite
# Reverse engineer from report
gittxt re exports/gittxt_summary.txt
๐ This will:
- Scan the repository root
- Output
.txtand.jsonsummary files - Bundle outputs in a ZIP with manifest and summary
- Reconstruct original files and structure from a Gittxt report
More examples โ Usage Examples
๐ฅ๏ธ CLI Usage
gittxt scan [OPTIONS] [REPOS]...
๐ฆ Scan directories or GitHub repos (textual only).
Options
| Option | Description |
|---|---|
-x, --exclude-dir |
Exclude folder paths |
-o, --output-dir PATH |
Custom output directory |
-f, --output-format TEXT |
Comma-separated: txt, json, md |
-i, --include-patterns TEXT |
Glob to include (only textual) |
-e, --exclude-patterns TEXT |
Glob to exclude |
--zip |
Create a ZIP bundle |
--lite |
Generate minimal output instead of full content |
--sync |
Opt-in to .gitignore usage |
--size-limit INTEGER |
Max file size in bytes |
--branch TEXT |
Git branch for remote repos |
--tree-depth INTEGER |
Limit tree output to N levels |
--log-level [debug|info|warning|error] |
Set log verbosity level |
--help |
Show CLI help and exit |
Run gittxt scan --help for the full reference.
Reverse Engineer Command
gittxt re [OPTIONS] REPORT_FILE
๐ Reconstruct original files and structure from Gittxt .txt, .md, or .json reports. Outputs a ZIP with recovered content.
Options
| Option | Description |
|---|---|
-o, --output-dir |
Custom output directory for reconstructed files |
Example Usage
gittxt re path/to/report.txt
This will:
- Take a Gittxt-generated report (
.txt,.md, or.json) - Reconstruct the original file structure as a ZIP archive
- Save the ZIP to the specified output directory or the current directory by default
๐ Learn more โ Reverse Engineering Guide
๐ฆ Output Formats
Each scan produces structured outputs:
<output_dir>/
โโโ text/ # .txt
โโโ json/ # .json
โโโ md/ # .md
โโโ zips/ # .zip (optional)
โ โโโ manifest.json, summary.json, outputs/, assets/
See Formats Guide
๐ How It Works
- ๐ Clone repo (local or GitHub, with branch/subdir support)
- ๐ฒ Walk repo with filtering and MIME rules
- ๐ Classify TEXTUAL vs NON-TEXTUAL
- ๐ Format output to
.txt,.json,.md - ๐ฆ Bundle ZIP with summary + manifest (optional)
- ๐งน Clean temp state after scan
๐งฐ Gittxt Installer
Run the interactive installer to configure Gittxt preferences:
gittxt config install
This command lets you:
- Set default output directory and formats (txt/json/md)
- Configure log level (
DEBUG,INFO,WARNING,ERROR) - Enable or disable automatic ZIP bundling
- Define or override:
- Textual extensions (e.g.
.py,.md) - Non-textual extensions (e.g.
.png,.zip) - Excluded directories (e.g.
.git,node_modules)
- Textual extensions (e.g.
The config is saved to gittxt-config.json and used as default for all scans.
๐ Configuration
- CLI flags (e.g.,
--output-dir,--size-limit) - Environment variables (e.g.,
GITTXT_OUTPUT_DIR) .gittxtignorefile support for exclusions
Config details โ docs/CONFIGURATION.md
๐ Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines
๐ค Contributing
We welcome contributions from the community!
๐ฃ๏ธ Roadmap
- โ Async file scanning
- โ ZIP archive export with manifest
- โ Lite mode output
- โณ AI-powered summaries (GPT, Claude)
- โณ YAML + CSV output support
- โณ Web UI via FastAPI
๐ License
MIT License ยฉ Sandeep Paidipati
Gittxt โ Get text from Git repositories in AI-ready formats.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gittxt-1.7.3.tar.gz.
File metadata
- Download URL: gittxt-1.7.3.tar.gz
- Upload date:
- Size: 34.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71f8f00f28f1f80a4f716afa863bca4e57ecdfa8a406ac645701b82075d8e4e6
|
|
| MD5 |
a16ac75ebed5a4c0ae49f78070c14573
|
|
| BLAKE2b-256 |
6c08d2e72db4b0d04c791fa6ba64a76f46ae91ffff22e105c7b7ccc1fafbe5dc
|
File details
Details for the file gittxt-1.7.3-py3-none-any.whl.
File metadata
- Download URL: gittxt-1.7.3-py3-none-any.whl
- Upload date:
- Size: 46.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbf6ee9ecd7be9a869b235194dee28641cd012b4754ba904f175007207f38c2b
|
|
| MD5 |
61e7adc90bb1407cbdd892e06437b4a3
|
|
| BLAKE2b-256 |
00497569d6ce0a409e8eaf35109bc57730736d6041ae1fdc5414821956046527
|