Skip to main content

Build a GitLab instance project index and search repositories for sensitive keywords (API-only, no cloning).

Project description

GitlabHarvester — Global GitLab Code & Secret Search Tool (Python)

PyPI Python License Last Commit

GitlabHarvester is a fast, scalable tool for searching keywords across an entire GitLab instance using the API — without cloning repositories. Built for security audits, secret discovery, compliance checks, and large-scale code intelligence across thousands of projects.

Global term search across a full GitLab instance — especially valuable for GitLab CE environments.


⚡ Quick Start

Search a keyword:

gitlab-harvester -u https://gitlab.example.com -t $TOKEN --search password

Search from file:

gitlab-harvester -u https://gitlab.example.com -t $TOKEN --terms-file words.txt

Build project index only:

gitlab-harvester -u https://gitlab.example.com -t $TOKEN -m dump-index

Deduplicate results:

gitlab-harvester -m dedup --input-file session.jsonl --output-file clean.jsonl

Convert JSONL → JSON:

gitlab-harvester -m convert --input-file session.jsonl --output-file result.json

🚀 Overview

GitLab Community Edition does not provide full instance-wide code search like EE. GitlabHarvester fills this gap by:

  • building a lightweight instance project index
  • scanning repositories via API
  • streaming results in JSONL
  • supporting resumable sessions
  • keeping memory usage constant

Designed to operate efficiently on environments with 10k–100k repositories.


🔍 Key Advantages

Problem Solution
No global search Instance-wide scan
Cloning thousands repos API-only scanning
Large instances Streaming architecture
Repeated audits Cached project index

✨ Features

  • Instance-wide keyword search
  • No repository cloning
  • JSONL project index
  • Branch scanning strategies
  • Smart fork analysis
  • Resume interrupted scans
  • Streaming output
  • Low memory footprint
  • Automation-friendly
  • Built-in post-processing tools

📦 Installation

Recommended — install from PyPI

pipx install gitlab-harvester

Run:

gitlab-harvester --help

Alternative — pip

pip install gitlab-harvester

Development install

git clone https://github.com/Cur1iosity/GitlabHarvester.git
cd GitlabHarvester
pip install .

Editable mode:

pip install -e .

Install latest dev version

pipx install git+https://github.com/Cur1iosity/GitlabHarvester.git

Requirements

  • Python 3.10+
  • GitLab token with read_api permission

🌿 Branch Control

Two independent controls:

  • --index-branches — stored branches
  • --scan-branches — scanned branches

Example:

gitlab-harvester -u ... -t ... --scan-branches 10

Store all + scan all:

gitlab-harvester -u ... -t ... --index-branches all --scan-branches all

Shortcut:

--branches N

🍴 Fork Strategies

--forks skip|include|branch-diff|all-branches

Recommended → branch-diff

Mode Behavior
skip ignore forks
include treat as normal repos
branch-diff scan default + unique branches
all-branches full exhaustive scan

💾 Sessions & Resume

Create session:

gitlab-harvester -u ... -t ... --terms-file words.txt --session audit

Resume:

gitlab-harvester -u ... -t ... --session-file audit.jsonl --resume

📊 Output

Two file types:

File Purpose
Project index cached project metadata
Session file hits + checkpoints

Format → JSONL (streaming-friendly)


🧰 Post-Processing Modes

GitlabHarvester includes built-in post-processing utilities.

Deduplicate results

gitlab-harvester -m dedup \
  --input-file session.jsonl \
  --output-file clean.jsonl

Options:

  • --sqlite-path file.sqlite
  • --hash-algo blake2b|sha1|sha256
  • --no-normalize-hits

Convert JSONL → JSON

gitlab-harvester -m convert \
  --input-file session.jsonl \
  --output-file result.json

Pretty print:

jq . result.json > formatted.json

🏗 Architecture

GitLab API
   ↓
Indexer
   ↓
Branch planner
   ↓
Matcher
   ↓
JSONL stream

Constant memory usage regardless of instance size.


🎯 Typical Use Cases

  • secret discovery
  • credential leaks detection
  • internal audits
  • redteam/pentest reconnaissance
  • DevSecOps validation
  • large-scale code search

🔐 Security Notice

Use only on GitLab instances where you are authorized to perform scanning.


🤝 Contributing

Pull requests and ideas welcome.


📜 License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlab_harvester-0.2.8.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gitlab_harvester-0.2.8-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file gitlab_harvester-0.2.8.tar.gz.

File metadata

  • Download URL: gitlab_harvester-0.2.8.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gitlab_harvester-0.2.8.tar.gz
Algorithm Hash digest
SHA256 732b2904d9ba9c760d8bef9f10a7f072a6241be13bec93f3525286e7b1af68a0
MD5 b7a7dcec8368b30f53693c65e916d3c6
BLAKE2b-256 4023586cc7728b36999ba89c9eb3d6bef11a47631c81dd28d9190c8c9f3e4cdc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gitlab_harvester-0.2.8.tar.gz:

Publisher: python-publish.yml on Cur1iosity/GitlabHarvester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gitlab_harvester-0.2.8-py3-none-any.whl.

File metadata

File hashes

Hashes for gitlab_harvester-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 fe4fc0290ba1c35e2a070be416e2d075c03b15627e8c5bf16eb2b7bc01b4e2b5
MD5 9a9f6d108c1fe41c712bb8623b96f905
BLAKE2b-256 2462e891447063e97a275c1c358c95ca1ef9f5f0ddc1808986dd91ef6aab80b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for gitlab_harvester-0.2.8-py3-none-any.whl:

Publisher: python-publish.yml on Cur1iosity/GitlabHarvester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page