Build GitLab instance project index (JSONL) and search repositories for sensitive keywords.
Project description
GitlabHarvester
Global term search across an entire GitLab instance — especially useful for GitLab CE.
GitLab Community Edition does not provide instance‑wide code search the way GitLab EE can.
GitlabHarvester fills this gap: it builds a lightweight Instance Project Index (JSONL/NDJSON) and performs term search across repositories without cloning them.
The tool is conceptually similar to utilities like gitlab-finder (Node.js), but implemented in modern Python with streaming output, branch planning and resumable sessions.
Why this tool matters
- GitLab CE → no global code search
- Web UI search → limited and unreliable
- Cloning thousands of repos → slow & disk heavy
GitlabHarvester lets you search the whole instance using only the API.
Features
- ✅ Instance‑wide keyword search for GitLab CE
- ✅ No cloning required — API based
- ✅ Project Index (JSONL/NDJSON) for repeatable runs
- ✅ Branch strategies:
default— scan only default branch (fast)all— scan all indexed branchesN— scan up to N branches
- ✅ Fork strategies (explained below)
- ✅ Session output + resume
- ✅ Low memory footprint
Requirements
- Python 3.11+
- GitLab token with read_api permissions
Installation
Using pipx (recommended)
git clone https://github.com/Cur1iosity/GitlabHarvester.git
cd GitlabHarvester
pipx install .
or
pipx install git+https://github.com/Cur1iosity/GitlabHarvester.git
After that you can run the tool directly:
gitlab-harvester --help
Classic pip install
git clone https://github.com/Cur1iosity/GitlabHarvester.git
cd GitlabHarvester
pip install .
or
pip install git+https://github.com/Cur1iosity/GitlabHarvester.git
Quick Start (the index builds automatically)
You do not need to build the project index manually.
When you run a search, the index is created on the fly if it does not exist.
Search a single keyword
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --search "password"
Search using a file with keywords
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --terms-file keywords.txt
Build only the index (optional)
This step is useful only if you want to prepare the index in advance:
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --dump-only
Branch control
There are two independent controls:
--index-branches— what branches are stored in the index--scan-branches— what branches are actually scanned
Examples
# Index only default branches, but scan up to 10
gitlab-harvester -H ... -t ... --scan-branches 10
# Store all branches and scan all
gitlab-harvester -H ... -t ... --index-branches all --scan-branches all
Shorthand:
gitlab-harvester -H ... -t ... --branches 10
Fork strategies (important)
--forks skip|include|branch-diff|all-branches
What they mean
-
skip
Forked projects are completely ignored.
Good when forks are mostly duplicates and noise. -
include
Forks are treated like normal projects.
Simple and predictable but may rescan identical branches. -
branch-diff (recommended)
Smart mode:- always scans fork default branch
- scans base branches (
main, master, develop, dev) - scans only branches unique to the fork compared to upstream
→ best signal/noise ratio.
-
all-branches
Scan every branch of every fork — most exhaustive and slowest.
Example
gitlab-harvester -H ... -t ... --terms-file keywords.txt --forks branch-diff --fork-diff-bases main,master,develop,dev
Session & resume
Results are written to JSONL session files.
gitlab-harvester -H ... -t ... --terms-file keywords.txt --session audit_run
Resume:
gitlab-harvester -H ... -t ... --terms-file keywords.txt --session-file audit_run.jsonl --resume
Output
- Project Index (JSONL) — metadata + project entries
- Session file (JSONL) — hits + checkpoints
Usage
gitlab-harvester --help
usage: gitlab-harvester [-h] -H HOST -t TOKEN [-bs BATCH_SIZE] [--index-file INDEX_FILE] [--dump-projects] [--dump-only] [-b BRANCHES] [--index-branches INDEX_BRANCHES] [--scan-branches SCAN_BRANCHES]
[--branches-per-page BRANCHES_PER_PAGE] [--forks {skip,include,branch-diff,all-branches}] [--fork-diff-bases FORK_DIFF_BASES] [-s SEARCH | -f TERMS_FILE] [--session SESSION |
--session-file SESSION_FILE] [-o OUTPUT] [--resume]
Collect and use an Instance Project Index from a GitLab instance.
options:
-h, --help show this help message and exit
-H, --host HOST GitLab host (e.g., gitlab.example.com).
-t, --token TOKEN GitLab token with read_api permissions.
-bs, --batch-size BATCH_SIZE
Projects per page for GitLab API requests (default: 100).
--index-file INDEX_FILE
Path to Instance Project Index file (JSONL/NDJSON). Defaults to instance-specific name.
--dump-projects Rebuild the Instance Project Index even if it already exists.
--dump-only Only build the Instance Project Index and exit.
-b, --branches BRANCHES
Shorthand for setting both --index-branches and --scan-branches.
--index-branches INDEX_BRANCHES
Branch depth for building the Project Index: 'default' (store only default branch), 'all' (store all), or N limit.
--scan-branches SCAN_BRANCHES
Branch scope for scanning: omit -> scan default only; 'all' -> scan all branches from index; N -> scan up to N branches (default + N-1).
--branches-per-page BRANCHES_PER_PAGE
Branches per page for GitLab API requests (default: 100).
--forks {skip,include,branch-diff,all-branches}
How to handle forked projects during search: skip (ignore forks), include (treat as regular projects), branch-diff (scan only base + unique branches vs upstream), all-branches (scan
every branch of forks).
--fork-diff-bases FORK_DIFF_BASES
Comma-separated list of branch names always scanned in forks when --forks=branch-diff (default: main,master,develop,dev).
-s, --search SEARCH Single search term.
-f, --terms-file TERMS_FILE
File with search terms (one per line).
--session SESSION Session name for results output (writes <name>.jsonl).
--session-file SESSION_FILE
Explicit path for session results file (JSONL).
-o, --output OUTPUT Output file for results (optional).
--resume Resume search using an existing session file (if supported).
Useful notes
Deduplicate results (context unique)
Search across forks and mirrors often produces context duplicates — identical file fragments that appear in multiple repositories or branches. Removing them is useful when:
you only need to confirm the fact of presence of a secret/keyword,
the same leaked token appears in dozens of forks,
you want to reduce a 1–5 GB session file to a human-reviewable size.
The dedup script keeps only one record per unique content, while preserving the original JSONL structure.
What it does:
- hashes normalized search content,
- keeps the first occurrence,
- drops identical matches from other projects/branches.
Run:
python scripts/dedup.py \
--input session_20250312.jsonl \
--output session_20250312_dedup.jsonl
Options:
--no-normalize — treat content strictly (no whitespace normalization)
--sqlite /path/db.sqlite — external store for very large files.
This is not classic deduplication by location — different repositories are preserved, but identical content matches are unified.
Convert JSONL to JSON
Session files are stored as JSONL for streaming and resume support. For manual analysis you may want a single JSON document.
Run:
python scripts/convert_jsonl_to_json.py \
--input session_20250312_dedup.jsonl \
--output session_20250312.json
The converter produces a compact minified JSON. For readable formatting use jq:
jq . session_20250312.json > session_20250312_pretty.json
Why convert:
- easier browsing in editors,
- compatibility with SIEM/ETL tools,
- convenient diff between sessions.
Security note
Use only on GitLab instances where you have authorization.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gitlab_harvester-0.1.0.tar.gz.
File metadata
- Download URL: gitlab_harvester-0.1.0.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f83706ba2611c0a1ea8f2a207b631f925679446e3d84ff34366506a874a42f2
|
|
| MD5 |
e19a4089c6ff2bdb4e9157b9636e2a45
|
|
| BLAKE2b-256 |
38178ef0f8f1370e5f20c3d04fe8c8d232de4942802cbdbf6992c6e083a54408
|
Provenance
The following attestation bundles were made for gitlab_harvester-0.1.0.tar.gz:
Publisher:
python-publish.yml on Cur1iosity/GitlabHarvester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gitlab_harvester-0.1.0.tar.gz -
Subject digest:
9f83706ba2611c0a1ea8f2a207b631f925679446e3d84ff34366506a874a42f2 - Sigstore transparency entry: 955807762
- Sigstore integration time:
-
Permalink:
Cur1iosity/GitlabHarvester@874071481a05929a7f5ef3302d8010d4f4c4805a -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/Cur1iosity
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@874071481a05929a7f5ef3302d8010d4f4c4805a -
Trigger Event:
release
-
Statement type:
File details
Details for the file gitlab_harvester-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gitlab_harvester-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e528f830f79726a3fff7af2899df6d39716304e7312054f8f19817a6079efaa
|
|
| MD5 |
9d9752c3e0c86eff0022bb89b78db09c
|
|
| BLAKE2b-256 |
997efbcb1d7d2539c6dce68ce163302dbf367674c5b6aa88f0cf1953016b23fa
|
Provenance
The following attestation bundles were made for gitlab_harvester-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on Cur1iosity/GitlabHarvester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gitlab_harvester-0.1.0-py3-none-any.whl -
Subject digest:
7e528f830f79726a3fff7af2899df6d39716304e7312054f8f19817a6079efaa - Sigstore transparency entry: 955807765
- Sigstore integration time:
-
Permalink:
Cur1iosity/GitlabHarvester@874071481a05929a7f5ef3302d8010d4f4c4805a -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/Cur1iosity
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@874071481a05929a7f5ef3302d8010d4f4c4805a -
Trigger Event:
release
-
Statement type: