Skip to main content

Mine (anchor, positive, negative) triplets from git history for contrastive learning

Project description

triplet-miner

Mine (anchor, positive, negative) triplets from git history for contrastive learning.

Given a git repo, it extracts commit messages as anchors, diffs as positives, and selects negatives from other commits/file contents using configurable strategies.

Quick Start

from triplet_miner import TripletMiner, MiningStrategy

miner = TripletMiner(
    default_strategy=MiningStrategy.HARD_NEGATIVE,
    max_commits=500,
    negatives_per_anchor=1,
)

triplets = miner.mine_from_repo("/path/to/repo")
print(f"Mined {len(triplets)} triplets")

# Export
miner.export(triplets, "output.json")

Command Line

# Install
pip install -e .

# Run from Python
python -c "
from triplet_miner import TripletMiner
miner = TripletMiner()
triplets = miner.mine_from_repo('.')
for t in triplets[:3]:
    print(f'anchor: {t.anchor[:80]}...')
    print(f'positive length: {len(t.positive)} chars')
    print(f'negative length: {len(t.negative)} chars')
    print(f'similarity: {t.similarity:.3f}')
    print()
"

What Gets Mined

For each non-merge commit in the repo:

Field Source
anchor Commit message
positive Full diff of the commit (capped at 5000 chars)
negative Selected from other commits' diffs or current file contents

Commits with messages <10 chars or empty diffs are skipped.

Mining Strategies

from triplet_miner import MiningStrategy
Strategy How it picks negatives
RANDOM Random selection from candidate pool
HARD_NEGATIVE Picks candidates most similar to anchor (hardest to distinguish)
SEMI_HARD Picks candidates with similarity between 0 and anchor-positive similarity
DOMAIN_AWARE Picks negatives from different repos (cross-repo mining only)

Similarity is computed using 3-shingle Jaccard (min-hash style) with word-level Jaccard fallback.

Strategy Examples

# Hard negatives: best for training discriminative models
miner = TripletMiner(strategy=MiningStrategy.HARD_NEGATIVE)

# Semi-hard: balanced between easy and hard
miner = TripletMiner(strategy=MiningStrategy.SEMI_HARD)

# Multi-repo with cross-domain negatives
triplets = miner.mine_from_repos(
    ["/path/to/repo-a", "/path/to/repo-b", "/path/to/repo-c"],
    strategy=MiningStrategy.DOMAIN_AWARE,
)

With DOMAIN_AWARE, negatives come from different repos than the anchor, and a deterministic hash ensures consistency.

Output Format

JSON

[
  {
    "anchor": "fix: handle empty diff in triplet mining",
    "positive": "diff --git a/triplet_miner/git_miner.py ...\n-index abc1234..def5678 100644\n...",
    "negative": "diff --git a/README.md ...\n Completely unrelated change...",
    "similarity": 0.23,
    "source": "triplet-miner",
    "metadata": {
      "sha": "abc1234def567",
      "author": "developer",
      "timestamp": 1700000000.0,
      "files_changed": 3,
      "files": ["triplet_miner/git_miner.py", "tests/test_git_miner.py"],
      "negative_similarity": 0.15
    }
  }
]

CSV

anchor,positive,negative,similarity,source,metadata
"fix: handle...","diff --git...","diff --git...",0.23,"triplet-miner","{...}"

Quality Filtering

from triplet_miner import QualityFilter, Triplet

qf = QualityFilter(
    min_length=10,        # minimum chars for anchor/positive/negative
    max_length=50000,     # maximum chars
    deduplicate=True,     # remove near-duplicate triplets
    languages={"python", "rust"},  # only keep these languages
    min_quality=0.3,      # minimum quality score
)

filtered = qf.filter(triplets)

Quality score factors (0.0–1.0):

  • Base: 0.5
  • Length sweet spot (50–2000 chars): +0.1 per field
  • Moderate length (20–5000): +0.05 per field
  • Anchor-positive similarity: +0 to +0.2
  • Metadata present: +0.05
  • SHA in metadata: +0.05

Language detection uses file extensions mapped to 20+ languages.

Multi-Repo Mining

from triplet_miner import TripletMiner, MiningStrategy, QualityFilter

miner = TripletMiner(
    default_strategy=MiningStrategy.DOMAIN_AWARE,
    max_commits=200,
    negatives_per_anchor=2,
)

triplets = miner.mine_from_repos(
    ["/repos/plato-core", "/repos/fleet-router", "/repos/constraint-substrate"],
    min_quality=0.3,
)

miner.export(triplets, "training-data.json")

mine_from_repos mines each repo independently, then applies quality filtering. With DOMAIN_AWARE, cross-repo negatives replace same-repo ones.

PyTorch / HuggingFace Integration

from triplet_miner import TripletMiner

miner = TripletMiner()
triplets = miner.mine_from_repo(".")

# PyTorch Dataset (requires pip install triplet-miner[torch])
dataset = miner.to_dataset(triplets)
# TripletDataset(len=150)

# HuggingFace Dataset (requires pip install datasets)
hf_dataset = miner.to_hf_dataset(triplets)

Install

pip install -e .

# With PyTorch support
pip install -e ".[torch]"

# With HuggingFace
pip install -e ".[torch]" datasets

TypeScript / npm

An npm package is available at npm/:

cd npm
npm install
npm run build

Tests

pytest tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

triplet_miner-0.1.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

triplet_miner-0.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file triplet_miner-0.1.0.tar.gz.

File metadata

  • Download URL: triplet_miner-0.1.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for triplet_miner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 41a39a495280a70488ae9a99aee3af56d754958e562698dea2d2a6d2e27f914f
MD5 bd80af0faeb0173720fb5ab3f21ef198
BLAKE2b-256 a8c85faa0bb7aa7d1723ab5b145f3d6929f6a1890a91cecccfec866e328ff661

See more details on using hashes here.

File details

Details for the file triplet_miner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: triplet_miner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for triplet_miner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7147811542cf07cb56ae4170f52a11c284f65e6db6b7b4c862774cad52d9b6f
MD5 fc933c16275b6653337fcb44d1a48740
BLAKE2b-256 7b5a13f28e62a1abdf11440b609d734fd40f9af6a06bba3a4719bc12b77c605d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page