Skip to main content

Git diff for meaning: detect semantic shifts, claim changes, tone drift, and risk changes in text.

Project description

SemShift

CI Python License: MIT PyPI

Git diff for meaning. Detect semantic shifts, claim changes, tone drift, and risk changes in text — local-first, no paid API required.


Git tells you what words changed. SemShift tells you what meaning changed.

When a privacy policy quietly switches from "We do not share data" to "We may share data with selected partners," git diff shows one line changed. SemShift flags it as CRITICAL and explains exactly why.

$ semshift compare old_policy.md new_policy.md --mode policy

╭──────────────────────────── SemShift Report ──────────────────────────────╮
│  old_policy.md → new_policy.md                                            │
│  Mode: policy | Backend: tfidf                                            │
╰───────────────────────────────────────────────────────────────────────────╯

Overall semantic drift: 0.71  CRITICAL

Review Summary
- 4 semantically changed chunks
- 5 changed claims
- Risk increased: third-party sharing (critical).

Meaning Changes To Review
 1. Data Sharing — SEMANTICALLY CHANGED (drift: 0.89)
    Old: "We do not share personal data with third parties."
    New: "We may share personal data with selected partners."
    Why: Data-sharing policy changed.

 2. Liability — SEMANTICALLY CHANGED (drift: 0.80)
    Old: "We make reasonable efforts to protect user data."
    New: "We disclaim liability for indirect damages."
    Why: Liability shifted to users.

Risk Flags
- CRITICAL  third-party sharing — changed from no sharing to conditional sharing
- HIGH      longer retention    — 30 days → 180 days
- HIGH      reduced consent     — opt-out language appears removed

Recommended Next Steps
- Hold approval until highlighted meaning changes are reviewed.
- Route policy/privacy risk flags to the responsible legal or trust owner.
- Verify numeric changes (30 → 180 days) against the source of truth.

Why SemShift?

Most review tools are literal. They show a sentence changed — not whether the promise, obligation, or risk changed. That gap matters in:

Document What git diff misses
Privacy policy We do not shareWe may share with partners
Research paper Accuracy metric quietly inflated from 78% → 95%
System prompt Safety rule removed, hidden instruction added
Resume 18% latency reduction45% latency reduction
README experimental dropped, guaranteed added
Terms of service Arbitration clause silently inserted

SemShift gives reviewers a fast local signal for the parts worth reading carefully — before approving a PR or signing off on a document.


Features

  • Semantic matching — aligns chunks by meaning, not line number
  • Claim extraction — numbers, dates, metrics, modal verbs, strong phrases, policy terms, role/title terms
  • Tone analysis — cautious → confident, neutral → restrictive, technical → promotional
  • Risk heuristics — mode-specific flags with severity levels (low / medium / high / critical)
  • 6 domain modes — policy, research, resume, prompt, readme, default
  • Two embedding backends — TF-IDF (fast, offline, default) or SentenceTransformers (optional, deeper)
  • Multiple output formats — Rich terminal, JSON, markdown reports
  • GitHub Action — drop-in CI check with PR comments and artifacts
  • Local-first — no external API calls, no data leaves your machine

Installation

Basic — TF-IDF backend (fast, works fully offline)

pip install semshift

With SentenceTransformers — deeper semantic embeddings (optional)

pip install "semshift[models]"

Then pass a model name:

semshift compare old.md new.md --model sentence-transformers/all-MiniLM-L6-v2

Development

git clone https://github.com/VeerajSai/SemShift.git
cd SemShift
pip install -e ".[dev]"
pytest

Quick Start

Compare two files:

semshift compare old_policy.md new_policy.md --mode policy

Compare raw text:

semshift compare-text \
  "We do not share personal data with third parties." \
  "We may share personal data with selected partners." \
  --mode policy

JSON output (for scripting or CI):

semshift compare old.md new.md --json

Generate a markdown report:

semshift compare old.md new.md --report report.md --top 10

Fail CI when drift is critical:

semshift compare old.md new.md --fail-on critical

List all available modes:

semshift modes

CLI Reference

semshift compare <old> <new> [OPTIONS]
semshift compare-text <old_text> <new_text> [OPTIONS]
semshift modes
Option Default Description
--mode default Review mode: default, policy, readme, research, resume, prompt
--model tfidf Embedding backend: tfidf (fast, offline) or a SentenceTransformers model name
--json off Machine-readable JSON output
--report <path> Write a markdown report to disk
--top <n> 5 Number of top meaning changes to show (1–25)
--fail-on <label> Exit code 1 when drift ≥ label: low, medium, high, critical

Modes

Mode Best for What it watches
default General text Generic meaning drift
policy Privacy policies, ToS Data sharing, consent, retention, tracking, liability, arbitration
readme README, install docs Features, limitations, platforms, requirements, pricing, guarantees
research Papers, reports Metrics, datasets, baselines, limitations, conclusions, uncertainty
resume Resumes, CVs Role titles, impact metrics, company names, inflated claims
prompt System prompts, instructions Safety rules, hidden instructions, scope constraints, output format

Python API

compare_files()

from semshift import compare_files

result = compare_files(
    "old_policy.md",
    "new_policy.md",
    mode="policy",   # optional, default "default"
    model="tfidf",   # optional, default "tfidf"
)

print(result.drift_label)        # "critical"
print(result.overall_score)      # 0.71
print(result.summary)            # list of plain-English bullets

for flag in result.risk_flags:
    print(f"[{flag.severity.upper()}] {flag.category}: {flag.why}")

compare_text()

from semshift import compare_text

result = compare_text(
    old="We do not share personal data.",
    new="We may share personal data with partners.",
    mode="policy",
)

for item in result.claim_changes.modified_numbers:
    print(f"Number changed: {item['old']}{item['new']}")

Result object reference

result.overall_score       # float 0.0–1.0 — magnitude of semantic drift
result.drift_label         # str  — "low", "medium", "high", or "critical"
result.summary             # list[str] — plain-English review bullets
result.chunk_matches       # list[ChunkMatch] — matched, added, removed chunks
result.claim_changes       # ClaimDiff — numbers, modals, phrases, policy terms
result.tone_shift          # ToneShift — from/to label, score, explanation
result.risk_flags          # list[RiskFlag] — severity, category, why
result.recommendations     # list[str] — actionable next steps
result.embedding_backend   # str — "tfidf", "tfidf-fallback", or model name
result.warnings            # list[str] — any warnings (e.g., fallback used)

JSON Output

Use --json for machine-readable output suitable for CI pipelines or downstream tooling:

semshift compare old.md new.md --json
{
  "files": { "old": "old.md", "new": "new.md" },
  "mode": "policy",
  "overall_score": 0.71,
  "drift_label": "critical",
  "summary": ["4 semantically changed chunks", "5 changed claims"],
  "risk_flags": [
    { "severity": "critical", "category": "third-party sharing", "why": "..." }
  ],
  "recommendations": ["Hold approval until changes are reviewed."],
  "embedding_backend": "tfidf",
  "warnings": []
}

GitHub Action

Drop SemShift into any pull request workflow to catch semantic drift automatically.

Basic setup

name: SemShift Check

on:
  pull_request:
    paths:
      - "**/*.md"
      - "**/*.txt"
      - "**/*.yml"

permissions:
  contents: read
  pull-requests: write

jobs:
  semshift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: VeerajSai/SemShift@v1
        with:
          mode: "policy"
          fail_on: "critical"
          pr_comment: "true"

Advanced — specific files

- uses: VeerajSai/SemShift@v1
  with:
    files: "docs/PRIVACY.md,README.md,system_prompts/*.txt"
    mode: "policy"
    fail_on: "high"
    pr_comment: "true"
    report: "semshift-analysis.md"

Action inputs

Input Default Description
files auto-detect Comma-separated files or globs. Empty = auto-detect changed files in the PR.
mode default Review mode
fail_on high Fail when drift reaches: low, medium, high, critical
model tfidf Embedding backend
report semshift-report.md Path for the markdown report artifact
pr_comment false Post or update a PR comment with the drift summary
github_token github.token Token for PR comments

Action outputs

Output Description
report_path Path to the generated markdown report
worst_label Worst drift label found: low, medium, high, or critical

The action uploads a markdown report as a workflow artifact and can post a summary comment directly on the pull request.


How It Works

SemShift runs a fully local, explainable pipeline — no LLM calls, no black boxes:

Input files / text
      │
      ▼
 ┌──────────┐    ┌───────────┐    ┌───────────────────────────┐
 │  Loader  │───▶│  Chunker  │───▶│    Embedding Backend      │
 └──────────┘    └───────────┘    │  TF-IDF (default, fast)   │
                                  │  SentenceTransformers      │
                                  └──────────┬────────────────┘
                                             │  cosine similarity
                                             ▼
                                  ┌──────────────────────────┐
                                  │    Semantic Matcher      │
                                  │    (heading-aware)       │
                                  └──────────┬───────────────┘
                                             │
                          ┌──────────────────┼──────────────────┐
                          ▼                  ▼                   ▼
                   ┌────────────┐   ┌──────────────┐   ┌──────────────┐
                   │   Claim    │   │     Risk     │   │    Tone      │
                   │ Extractor  │   │   Analyzer   │   │   Analyzer   │
                   └────────────┘   └──────────────┘   └──────────────┘
                          │                  │                   │
                          └──────────────────┴───────────────────┘
                                             │
                                             ▼
                                  ┌──────────────────────────┐
                                  │     Report Generator     │
                                  │  Rich / JSON / Markdown  │
                                  │     GitHub Action        │
                                  └──────────────────────────┘
  1. Load — read supported text files with encoding fallback (UTF-8, UTF-8-sig, CP1252)
  2. Chunk — split into reviewable units, preserving headings and line ranges
  3. Embed — vectorize with TF-IDF (no download needed) or SentenceTransformers
  4. Align — match old chunks to new chunks via cosine similarity; heading-aware pre-alignment for structured documents
  5. Classify — label each chunk: unchanged, lightly changed, semantically changed, removed, or added
  6. Extract — pull out high-signal claims: numbers, dates, modals, strong phrases, policy terms, metrics
  7. Analyze — apply mode-specific risk heuristics and tone shift detection
  8. Report — produce Rich terminal output, JSON, markdown report, or GitHub Action summary

Supported File Types

Extension Format
.md, .rst Markdown / reStructuredText
.txt Plain text
.yml, .yaml YAML
.json JSON
.py, .js, .ts Source code

Examples

The examples/ directory has realistic paired documents for every mode:

# Policy drift (data sharing, retention, consent)
semshift compare examples/old_policy.md examples/new_policy.md --mode policy

# Terms of service
semshift compare examples/old_terms.md examples/new_terms.md --mode policy

# Research paper (metrics, baselines, limitations)
semshift compare examples/old_research.md examples/new_research.md --mode research

# Resume rewrite (inflated claims, changed titles)
semshift compare examples/old_resume.md examples/new_resume.md --mode resume

# System prompt (safety rules, hidden instructions)
semshift compare examples/old_prompt.txt examples/new_prompt.txt --mode prompt

# README changes (feature claims, requirements, pricing)
semshift compare examples/old_readme.md examples/new_readme.md --mode readme

See examples/sample_policy_report.md for a full markdown report example.


What SemShift Is Not

  • Not a legal opinion or compliance tool
  • Not a fact-checker or plagiarism detector
  • Not a replacement for human review
  • Not dependent on any paid LLM API

SemShift is a review assistant. It identifies likely semantic drift and explains why a human should look closely.


Contributing

Contributions are welcome. Most useful:

  • Real-world examples where word diff missed a meaningful semantic change
  • Improved chunking or matching that stays explainable
  • Mode-specific risk heuristics backed by tests
  • CLI, markdown, or GitHub Action UX improvements
  • Bug reports and edge case fixes

See CONTRIBUTING.md for the full guide — including how to add a new mode and the pull request checklist.

Development setup

git clone https://github.com/VeerajSai/SemShift.git
cd SemShift
pip install -e ".[dev]"

Run tests

pytest          # all tests
pytest -v       # verbose
pytest tests/test_cli.py  # specific file

Lint and format

ruff check .    # lint
ruff format .   # format

Changelog

See CHANGELOG.md for what changed in each release.


Security

Report vulnerabilities privately via GitHub Security Advisories.

See SECURITY.md for the full security policy.


License

MIT — free to use, modify, and distribute.


Community


Built for reviewers, maintainers, and teams that care about meaning — not just words.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semshift-0.1.0.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semshift-0.1.0-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file semshift-0.1.0.tar.gz.

File metadata

  • Download URL: semshift-0.1.0.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for semshift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4f8fba51aa7bbc4bcb1b8a4944b0bca718d5b084875363d18ed30786aee1ab4a
MD5 9a81afd6e47a2a43aa92b30a07913fce
BLAKE2b-256 9ae332b0b14ab60d82f9043e1364a7ac032bbf73c9471f2eb7a66f9c0ac5d975

See more details on using hashes here.

File details

Details for the file semshift-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: semshift-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for semshift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b81d9a87a5d5b331262be90cbcffd8670c7b65bea17023acc36c2fd7ed28f944
MD5 7684961d5aebaa41a4e93c0b8d0aa1c2
BLAKE2b-256 fc26be340508c5cca3746fb6ac4ea42b0b416978d7f99a6a9504665a71b1093f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page