Skip to main content

Local-first semantic review assistant that flags likely risky meaning changes in edited text.

Project description

SemShift

PyPI Python CI Security License: MIT

Catch risky meaning changes Git diff misses.

SemShift is a local-first review assistant for AI-rewritten and human-edited docs, prompts, policies, resumes, and research drafts. It flags likely semantic drift before you merge, publish, or submit text.

Current release line: v0.2.x alpha. The default backend is lexical + heuristic (tfidf). Optional SentenceTransformers embeddings are local semantic embeddings, not a claim of legal, factual, or scientific authority.

5-Second Demo

Before:

We do not share personal data with third parties.

After:

We may share personal data with trusted partners.

SemShift:

CRITICAL: privacy commitment weakened.
Risk flag: third-party sharing.
Recommendation: hold approval until a human reviews the change.

Install

pip install semshift

Optional local embedding backend:

pip install "semshift[models]"

Development:

pip install -e ".[dev]"

Quick Start

semshift compare examples/old_policy.md examples/new_policy.md --mode policy
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --json
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --report semshift-report.md

Use limits for large or generated files:

semshift compare old.md new.md --max-file-size 5242880 --max-chunks 2000

GitHub Action

name: SemShift Check

on:
  pull_request:
    paths:
      - "**/*.md"
      - "**/*.txt"
      - "**/*.yml"

permissions:
  contents: read
  pull-requests: write

jobs:
  semshift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: VeerajSai/SemShift@v0.2.0
        with:
          mode: policy
          fail_on: high
          pr_comment: "true"
          model: tfidf
          report: semshift-report.md

Inputs include files, mode, fail_on, model, report, base_ref, pr_comment, github_token, max_file_size, and max_chunks.

Note: fail_on defaults to high. The action exits with code 1 when any file reaches high or critical drift.

Python API

from semshift import compare_files, compare_text

result = compare_text(
    old="We do not share personal data.",
    new="We may share personal data with partners.",
    mode="policy",
)

print(result.drift_label)
print(result.summary)
print(result.risk_flags)
print(result.to_markdown())

file_result = compare_files("old_policy.md", "new_policy.md", mode="policy")
report = file_result.to_markdown()

Canonical fields include drift_label, overall_score, drift_score, summary, matched_chunks, chunk_matches, claim_changes, tone_shift, risk_flags, warnings, metadata, to_dict(), to_json(), and to_markdown().

Modes

Mode Maturity Best for Main signals
policy stable privacy policies, terms, consent language sharing, retention, rights, obligations
prompt stable system prompts and instruction files safety rules, hidden instructions, scope
research experimental research drafts and reports metrics, datasets, baselines, limitations
resume experimental resumes and bios titles, metrics, company/project names
readme experimental README and support docs install requirements, guarantees, scope
default stable general text review drift score, claims, tone, generic risk

How It Works

SemShift combines transparent signals:

  1. Chunk alignment by headings and text structure.
  2. Lexical TF-IDF similarity by default, or optional local SentenceTransformers embeddings.
  3. Claim extraction, tone signals, and mode-specific risk rules.

TF-IDF is a lexical backend, not a true semantic model. Optional embedding models may download weights on first use; document text is processed locally unless you explicitly integrate external services.

Benchmarks

SemShift includes a starter self-evaluation benchmark for regression tracking. See docs/benchmarks.md.

Do not treat starter benchmark numbers as external validation. Human-labeled external evaluation is still needed.

Compared To

Tool What it catches What it misses
Git diff exact text edits risk, claims, weakened obligations
diff-match-patch text similarity domain-specific meaning changes
LLM judge broad qualitative review local determinism, reproducibility, privacy by default
Grammar checker style and grammar policy, prompt, research, and factual drift
SemShift likely risky semantic drift subtle context, truth verification, legal authority

Limitations

SemShift is:

  • not legal advice
  • not a fact-checker
  • not scientific authority
  • not a replacement for human review
  • likely to miss subtle context-dependent changes
  • likely to false-positive on harmless paraphrases
  • lexical + heuristic by default

Troubleshooting

semshift: command not found: Confirm the active environment is the one where you installed semshift.

Model import error: Install optional dependencies with pip install "semshift[models]", or use --model tfidf.

Slow first model run: SentenceTransformers may download weights and initialize on first use.

Windows path issues: Quote paths with spaces and prefer PowerShell-compatible quoting.

GitHub Action fork PRs: PR comments can be unavailable for forks with restricted permissions; the report artifact is still written.

No files matched: Pass files, use actions/checkout with fetch-depth: 0, or check supported extensions.

Report too long: GitHub comments are truncated and the full report is uploaded as an artifact.

Roadmap

  • stronger external benchmark
  • NLI-based deep mode for contradiction/entailment checks
  • VS Code extension
  • web demo
  • docs site
  • more file formats

Author

Built by Veeraj Sai.

Citation

Please cite SemShift using CITATION.cff.

License

MIT. See LICENSE.

Security

Report vulnerabilities through GitHub Security Advisories. SemShift is local-first by default, but optional model downloads and external CI integrations should be reviewed in your environment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semshift-0.2.0.tar.gz (67.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semshift-0.2.0-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file semshift-0.2.0.tar.gz.

File metadata

  • Download URL: semshift-0.2.0.tar.gz
  • Upload date:
  • Size: 67.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for semshift-0.2.0.tar.gz
Algorithm Hash digest
SHA256 10d240348c5729fbda5e25db93180e1b4d8fee5c273af246a2d34b7efd6544e3
MD5 666ffe3dbee275101448f984e2542bfb
BLAKE2b-256 5a2386f586624bdb8b93aa1604b528774342b8017afaeea0541ed1fe1484e3e4

See more details on using hashes here.

File details

Details for the file semshift-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: semshift-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for semshift-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b384011e02575f58c0be645e05de9e149b35be5321d93722c5f4a95fd05ffcd6
MD5 7b7357310327ae81145c2b704be30b9f
BLAKE2b-256 726a6e6e08d3fab7b8de5e81bbeb82d6cc7d60736cd5203c22c0c37f36f4ea6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page