Gittxt: Get text from Git repositories in AI-ready formats
Project description
๐ LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines
๐ Gittxt: Get text from Git repositories in AI-ready formats.
โจ What is Gittxt?
Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.
Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:
- Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
- Documentation extraction for knowledge bases
- Code summarization pipelines
- Repository analysis for machine learning workflows
๐ Features
- โ
Dynamic File-Type Filtering (
--file-types=code,docs,images,csv,media,all) - โ
Automatic Tree Generation with clean filtering (excludes
.git/,__pycache__, etc.) - โ Multiple Output Formats: TXT, JSON, Markdown
- โ Optional ZIP Packaging for non-text assets
- โ CLI-friendly Progress Bars
- โ
Built-in Summary Reports (
--summary) - โ
Interactive & CI-ready Modes (
--non-interactive)
๐๏ธ Installation
๐ฆ Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install
๐ Using pip (stable)
pip install gittxt
โ๏ธ Quickstart Example
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --file-types code,docs --summary
๐ This will:
- Scan a GitHub repository
- Extract code & docs files
- Output
.txt+.jsonsummaries - Show a summary report
๐ฅ๏ธ CLI Usage
gittxt scan [REPOS]... [OPTIONS]
Options:
--include TEXT Include patterns (e.g., *.py)
--exclude TEXT Exclude patterns (e.g., tests/, node_modules)
--size-limit INTEGER Max file size in bytes
--branch TEXT Specify branch (for GitHub URLs)
--file-types TEXT code, docs, images, csv, media, all
--output-format TEXT txt, json, md, or comma-separated list
--output-dir PATH Custom output directory
--summary Show post-scan summary
--non-interactive Skip prompts for CI/CD workflows
--progress Enable scan progress bars
--debug Enable debug logs
--help Show this message and exit
๐ Output Structure
<output_dir>/
โโโ text/
โ โโโ repo-name.txt
โโโ json/
โ โโโ repo-name.json
โโโ md/
โ โโโ repo-name.md
โโโ zips/
โโโ repo-name_bundle.zip # Optional ZIP for assets (images, csv, etc.)
๐ How It Works
- ๐ Clone GitHub/local repo (supports branch/subdir URLs)
- ๐ณ Dynamically generate directory tree (excluding
.git,__pycache__, etc.) - ๐๏ธ Filter files based on type (code, docs, csv, media)
- ๐ Generate formatted outputs (TXT, JSON, MD)
- ๐ฆ Package assets (optional ZIP for non-text)
- ๐งน Cleanup temporary files (cache-free design)
๐ Example Summary Output
๐ Summary Report:
- Total files processed: 45
- Output formats: txt, json
- File type breakdown: {'code': 31, 'docs': 14}
๐ Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy
๐ค Contributing
We welcome community contributions!
๐ฃ๏ธ Roadmap
- FastAPI-powered web UI
- AI-powered summaries (GPT/OpenAI integration)
- Support YAML/CSV as additional output formats
- Async file scanning (speed boost)
๐ License
MIT License ยฉ Sandeep Paidipati
Gittxt โ โGittxt: Get text from Git repositories in AI-ready formats.โ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gittxt-1.5.0.tar.gz.
File metadata
- Download URL: gittxt-1.5.0.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dd6f7b97c73416a00fd4aa274f9c6714d87af043c69228b1680a2e040dbe3a6
|
|
| MD5 |
79597cf0d6c92fa5b181663e93bede44
|
|
| BLAKE2b-256 |
f7a2051e830d42d68d98c3db8cb6523ce8c23280e6607ce8c44499278da4d68f
|
File details
Details for the file gittxt-1.5.0-py3-none-any.whl.
File metadata
- Download URL: gittxt-1.5.0-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b6be8bc54555a4acf130616cf9674edf6ad1928e5c099633d6209ffffffe883
|
|
| MD5 |
476c8a8de8c5901ce4fee92a3302d55c
|
|
| BLAKE2b-256 |
95005e355ab4c8c26a6f3f61b523bbf891566a9ad5ff343d51184555559fc879
|