Build GitLab instance project index (JSONL) and search repositories for sensitive keywords.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Cur1iosity

These details have not been verified by PyPI

Project description

GitlabHarvester

Global term search across an entire GitLab instance — especially useful for GitLab CE.

GitLab Community Edition does not provide instance‑wide code search the way GitLab EE can.
GitlabHarvester fills this gap: it builds a lightweight Instance Project Index (JSONL/NDJSON) and performs term search across repositories without cloning them.

The tool is conceptually similar to utilities like gitlab-finder (Node.js), but implemented in modern Python with streaming output, branch planning and resumable sessions.

Why this tool matters

GitLab CE → no global code search
Web UI search → limited and unreliable
Cloning thousands of repos → slow & disk heavy

GitlabHarvester lets you search the whole instance using only the API.

Features

✅ Instance‑wide keyword search for GitLab CE
✅ No cloning required — API based
✅ Project Index (JSONL/NDJSON) for repeatable runs
✅ Branch strategies:
- default — scan only default branch (fast)
- all — scan all indexed branches
- N — scan up to N branches
✅ Fork strategies (explained below)
✅ Session output + resume
✅ Low memory footprint

Requirements

Python 3.11+
GitLab token with read_api permissions

Installation

Using pipx (recommended)

git clone https://github.com/Cur1iosity/GitlabHarvester.git
cd GitlabHarvester
pipx install .

pipx install git+https://github.com/Cur1iosity/GitlabHarvester.git

After that you can run the tool directly:

gitlab-harvester --help

Classic pip install

git clone https://github.com/Cur1iosity/GitlabHarvester.git
cd GitlabHarvester
pip install .

pip install git+https://github.com/Cur1iosity/GitlabHarvester.git

Quick Start (the index builds automatically)

You do not need to build the project index manually.
When you run a search, the index is created on the fly if it does not exist.

Search a single keyword

gitlab-harvester -H https://gitlab.example.com -t $TOKEN --search "password"

Search using a file with keywords

gitlab-harvester -H https://gitlab.example.com -t $TOKEN --terms-file keywords.txt

Build only the index (optional)

This step is useful only if you want to prepare the index in advance:

gitlab-harvester -H https://gitlab.example.com -t $TOKEN --dump-only

Branch control

There are two independent controls:

--index-branches — what branches are stored in the index
--scan-branches — what branches are actually scanned

Examples

# Index only default branches, but scan up to 10
gitlab-harvester -H ... -t ... --scan-branches 10

# Store all branches and scan all
gitlab-harvester -H ... -t ... --index-branches all --scan-branches all

Shorthand:

gitlab-harvester -H ... -t ... --branches 10

Fork strategies (important)

--forks skip|include|branch-diff|all-branches

What they mean

skip
Forked projects are completely ignored.
Good when forks are mostly duplicates and noise.
include
Forks are treated like normal projects.
Simple and predictable but may rescan identical branches.
branch-diff (recommended)
Smart mode:
- always scans fork default branch
- scans base branches (main, master, develop, dev)
- scans only branches unique to the fork compared to upstream
  → best signal/noise ratio.
all-branches
Scan every branch of every fork — most exhaustive and slowest.

Example

gitlab-harvester -H ... -t ...   --terms-file keywords.txt   --forks branch-diff   --fork-diff-bases main,master,develop,dev

Session & resume

Results are written to JSONL session files.

gitlab-harvester -H ... -t ... --terms-file keywords.txt --session audit_run

Resume:

gitlab-harvester -H ... -t ... --terms-file keywords.txt --session-file audit_run.jsonl --resume

Output

Project Index (JSONL) — metadata + project entries
Session file (JSONL) — hits + checkpoints

Usage

gitlab-harvester --help

usage: gitlab-harvester [-h] -H HOST -t TOKEN [-bs BATCH_SIZE] [--index-file INDEX_FILE] [--dump-projects] [--dump-only] [-b BRANCHES] [--index-branches INDEX_BRANCHES] [--scan-branches SCAN_BRANCHES]
                    [--branches-per-page BRANCHES_PER_PAGE] [--forks {skip,include,branch-diff,all-branches}] [--fork-diff-bases FORK_DIFF_BASES] [-s SEARCH | -f TERMS_FILE] [--session SESSION |
                    --session-file SESSION_FILE] [-o OUTPUT] [--resume]

Collect and use an Instance Project Index from a GitLab instance.

options:
  -h, --help            show this help message and exit
  -H, --host HOST       GitLab host (e.g., gitlab.example.com).
  -t, --token TOKEN     GitLab token with read_api permissions.
  -bs, --batch-size BATCH_SIZE
                        Projects per page for GitLab API requests (default: 100).
  --index-file INDEX_FILE
                        Path to Instance Project Index file (JSONL/NDJSON). Defaults to instance-specific name.
  --dump-projects       Rebuild the Instance Project Index even if it already exists.
  --dump-only           Only build the Instance Project Index and exit.
  -b, --branches BRANCHES
                        Shorthand for setting both --index-branches and --scan-branches.
  --index-branches INDEX_BRANCHES
                        Branch depth for building the Project Index: 'default' (store only default branch), 'all' (store all), or N limit.
  --scan-branches SCAN_BRANCHES
                        Branch scope for scanning: omit -> scan default only; 'all' -> scan all branches from index; N -> scan up to N branches (default + N-1).
  --branches-per-page BRANCHES_PER_PAGE
                        Branches per page for GitLab API requests (default: 100).
  --forks {skip,include,branch-diff,all-branches}
                        How to handle forked projects during search: skip (ignore forks), include (treat as regular projects), branch-diff (scan only base + unique branches vs upstream), all-branches (scan
                        every branch of forks).
  --fork-diff-bases FORK_DIFF_BASES
                        Comma-separated list of branch names always scanned in forks when --forks=branch-diff (default: main,master,develop,dev).
  -s, --search SEARCH   Single search term.
  -f, --terms-file TERMS_FILE
                        File with search terms (one per line).
  --session SESSION     Session name for results output (writes <name>.jsonl).
  --session-file SESSION_FILE
                        Explicit path for session results file (JSONL).
  -o, --output OUTPUT   Output file for results (optional).
  --resume              Resume search using an existing session file (if supported).

Useful notes

Deduplicate results (context unique)

Search across forks and mirrors often produces context duplicates — identical file fragments that appear in multiple repositories or branches. Removing them is useful when:

you only need to confirm the fact of presence of a secret/keyword,

the same leaked token appears in dozens of forks,

you want to reduce a 1–5 GB session file to a human-reviewable size.

The dedup script keeps only one record per unique content, while preserving the original JSONL structure.

What it does:

hashes normalized search content,
keeps the first occurrence,
drops identical matches from other projects/branches.

Run:

python scripts/dedup.py \
  --input session_20250312.jsonl \
  --output session_20250312_dedup.jsonl

Options:

--no-normalize — treat content strictly (no whitespace normalization)

--sqlite /path/db.sqlite — external store for very large files.

This is not classic deduplication by location — different repositories are preserved, but identical content matches are unified.

Convert JSONL to JSON

Session files are stored as JSONL for streaming and resume support. For manual analysis you may want a single JSON document.

Run:

python scripts/convert_jsonl_to_json.py \
  --input session_20250312_dedup.jsonl \
  --output session_20250312.json

The converter produces a compact minified JSON. For readable formatting use jq:

jq . session_20250312.json > session_20250312_pretty.json

Why convert:

easier browsing in editors,
compatibility with SIEM/ETL tools,
convenient diff between sessions.

Security note

Use only on GitLab instances where you have authorization.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Cur1iosity

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.14

Feb 24, 2026

0.2.13

Feb 20, 2026

0.2.12

Feb 20, 2026

0.2.11

Feb 20, 2026

0.2.10

Feb 19, 2026

0.2.9

Feb 19, 2026

0.2.8

Feb 19, 2026

0.2.7

Feb 19, 2026

0.2.6

Feb 18, 2026

0.2.5

Feb 17, 2026

0.1.6

Feb 17, 2026

0.1.5

Feb 16, 2026

0.1.3

Feb 16, 2026

This version

0.1.1

Feb 16, 2026

0.1.0

Feb 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlab_harvester-0.1.1.tar.gz (24.1 kB view details)

Uploaded Feb 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gitlab_harvester-0.1.1-py3-none-any.whl (26.4 kB view details)

Uploaded Feb 16, 2026 Python 3

File details

Details for the file gitlab_harvester-0.1.1.tar.gz.

File metadata

Download URL: gitlab_harvester-0.1.1.tar.gz
Upload date: Feb 16, 2026
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gitlab_harvester-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d412863f247070f19fe7762f0dfb48ce1fd2a17f6f6bc2b42f00357bb8a0fbdd`
MD5	`0b265defab89a78dbf9564a58d211d77`
BLAKE2b-256	`8f4218273ce395f391fd5a01f132c160fa9579541ad7bc86a8014da23d01dae0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gitlab_harvester-0.1.1.tar.gz:

Publisher: python-publish.yml on Cur1iosity/GitlabHarvester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gitlab_harvester-0.1.1.tar.gz
- Subject digest: d412863f247070f19fe7762f0dfb48ce1fd2a17f6f6bc2b42f00357bb8a0fbdd
- Sigstore transparency entry: 955861135
- Sigstore integration time: Feb 16, 2026
Source repository:
- Permalink: Cur1iosity/GitlabHarvester@af7c1a3e0e6ded610444f9a3e39ccd25bb6f9d70
- Branch / Tag: refs/tags/0.1.1
- Owner: https://github.com/Cur1iosity
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@af7c1a3e0e6ded610444f9a3e39ccd25bb6f9d70
- Trigger Event: release

File details

Details for the file gitlab_harvester-0.1.1-py3-none-any.whl.

File metadata

Download URL: gitlab_harvester-0.1.1-py3-none-any.whl
Upload date: Feb 16, 2026
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gitlab_harvester-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0920ffa07665ca7990b7725d5bdb7556080560b4dee5cf2bb4567102038829a3`
MD5	`6437835c28ded63c30cd514a23186814`
BLAKE2b-256	`19e8e70f4ddd7dc47941c1f7b6340cbaba45931e89f16657a99f314711f33069`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gitlab_harvester-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on Cur1iosity/GitlabHarvester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gitlab_harvester-0.1.1-py3-none-any.whl
- Subject digest: 0920ffa07665ca7990b7725d5bdb7556080560b4dee5cf2bb4567102038829a3
- Sigstore transparency entry: 955861139
- Sigstore integration time: Feb 16, 2026
Source repository:
- Permalink: Cur1iosity/GitlabHarvester@af7c1a3e0e6ded610444f9a3e39ccd25bb6f9d70
- Branch / Tag: refs/tags/0.1.1
- Owner: https://github.com/Cur1iosity
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@af7c1a3e0e6ded610444f9a3e39ccd25bb6f9d70
- Trigger Event: release

gitlab-harvester 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GitlabHarvester

Why this tool matters

Features

Requirements

Installation

Using pipx (recommended)

Classic pip install

Quick Start (the index builds automatically)

Search a single keyword

Search using a file with keywords

Build only the index (optional)

Branch control

Examples

Fork strategies (important)

What they mean

Example

Session & resume

Output

Usage

Useful notes

Deduplicate results (context unique)

Convert JSONL to JSON

Security note

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance