Gittxt: Get text from Git repositories in AI-ready formats
Project description
๐ LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines
๐ Gittxt: Get text from Git repositories in AI-ready formats
โจ What is Gittxt?
Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.
Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:
- Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
- Documentation extraction for knowledge bases
- Code summarization pipelines
- Repository analysis for machine learning workflows
๐ Features
- โ Dynamic File-Type Filtering (based on extension + MIME + content heuristics)
- โ Smart Directory Tree Summaries with configurable depth and excludes
- โ
Multiple Output Formats:
.txt,.json,.md,.zip - โ
Lite Mode (
--lite) for fast, minimal reports - โ
ZIP Bundling with
--zipincludingsummary.jsonand assets - โ Rich Summary Tables with size, tokens, and file breakdowns
- โ .gittxtignore support for per-repo custom exclusion
- โ Async I/O and CLI Progress Bars for performance and UX
๐๏ธ Installation
๐ฆ Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install
๐ Using pip (stable)
pip install gittxt
โ๏ธ Quickstart Example
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite
๐ This will:
- Scan the repository root
- Output
.txt+.jsonsummary files - Bundle them in a ZIP
For more real-world usage: Usage Examples โ
๐ฅ๏ธ CLI Usage
gittxt scan [REPOS]... [OPTIONS]
Common Flags
| Option | Description |
|---|---|
--include-patterns |
Glob to include (e.g., *.py, docs/**/*.md) |
--exclude-patterns |
Glob to exclude (e.g., tests/, *.zip) |
--size-limit |
Skip files larger than N bytes |
--branch |
Use a specific branch for remote repos |
--zip |
Create a bundled ZIP archive |
--lite |
Minimal output without full content |
--output-dir |
Where to write outputs |
--output-format |
txt, json, md, or comma-separated list |
Run gittxt scan --help for the full CLI reference.
๐ฆ Output Formats
Each scan produces structured outputs:
<output_dir>/
โโโ text/ # .txt
โโโ json/ # .json
โโโ md/ # .md
โโโ zips/ # .zip (optional)
๐ How It Works
- ๐ Clone repo (supports GitHub, local, subdirs)
- ๐ฒ Walk files with exclusion rules and MIME checks
- ๐ Classify files as TEXTUAL or NON-TEXTUAL
- ๐ Format text files to
.txt,.json,.md - ๐ฆ Zip outputs and assets (optional)
- ๐งน Remove temp files (stateless design)
๐งช Running Tests
make test
- Generates a test repo with multiple edge cases
- Runs full suite with Pytest
- Cleans up outputs
Test docs โ tests/README.md
๐ Configuration
- Override via CLI flags
- Or set env vars like
GITTXT_OUTPUT_DIR .gittxtignoreworks like.gitignore
Advanced setup โ docs/CONFIGURATION.md
๐ Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy
๐ค Contributing
We welcome community contributions!
๐ฃ๏ธ Roadmap
- โ Async file scanning
- โ ZIP archive export with manifest
- โ Lite mode output
- โณ AI-powered summaries (GPT, Claude)
- โณ YAML + CSV output support
- โณ Web UI via FastAPI
๐ License
MIT License ยฉ Sandeep Paidipati
Gittxt โ Get text from Git repositories in AI-ready formats.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gittxt-1.5.9.tar.gz.
File metadata
- Download URL: gittxt-1.5.9.tar.gz
- Upload date:
- Size: 29.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87bea4eede25f9f896245bf25f199f690084a5fbcc0bd7ef0d433dfb76f9aeaa
|
|
| MD5 |
cca5d9bbfc25b08af5ccee96f1481752
|
|
| BLAKE2b-256 |
ceb5d9c7597fef50db81d20a94cb0054d4ce4fa76a867fb7e07529d153286372
|
File details
Details for the file gittxt-1.5.9-py3-none-any.whl.
File metadata
- Download URL: gittxt-1.5.9-py3-none-any.whl
- Upload date:
- Size: 41.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
123edc06b7cac2cde82d99dc5463d1418410988fe3c2279a44989dac50eb3d4e
|
|
| MD5 |
97ad72df3e5f61a368f5b9ebd8a381a0
|
|
| BLAKE2b-256 |
c36bac9929e8c4817a789f3c9177e21313a2882c7224bb1e787f4fcc4b325b0f
|