Local-first semantic review assistant that flags likely risky meaning changes in edited text.
Project description
SemShift
Catch risky meaning changes Git diff misses.
SemShift is a local-first review assistant for AI-rewritten and human-edited docs, prompts, policies, resumes, and research drafts. It flags likely semantic drift before you merge, publish, or submit text.
Current release line: v0.2.x alpha. The default backend is lexical + heuristic (tfidf). Optional SentenceTransformers embeddings are local semantic embeddings, not a claim of legal, factual, or scientific authority.
5-Second Demo
Before:
We do not share personal data with third parties.
After:
We may share personal data with trusted partners.
SemShift:
CRITICAL: privacy commitment weakened.
Risk flag: third-party sharing.
Recommendation: hold approval until a human reviews the change.
Install
pip install semshift
Optional local embedding backend:
pip install "semshift[models]"
Development:
pip install -e ".[dev]"
Quick Start
semshift compare examples/old_policy.md examples/new_policy.md --mode policy
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --json
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --report semshift-report.md
Use limits for large or generated files:
semshift compare old.md new.md --max-file-size 5242880 --max-chunks 2000
GitHub Action
name: SemShift Check
on:
pull_request:
paths:
- "**/*.md"
- "**/*.txt"
- "**/*.yml"
permissions:
contents: read
pull-requests: write
jobs:
semshift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: VeerajSai/SemShift@v0.2.0
with:
mode: policy
fail_on: high
pr_comment: "true"
model: tfidf
report: semshift-report.md
Inputs include files, mode, fail_on, model, report, base_ref, pr_comment, github_token, max_file_size, and max_chunks.
Note:
fail_ondefaults tohigh. The action exits with code 1 when any file reaches high or critical drift.
Python API
from semshift import compare_files, compare_text
result = compare_text(
old="We do not share personal data.",
new="We may share personal data with partners.",
mode="policy",
)
print(result.drift_label)
print(result.summary)
print(result.risk_flags)
print(result.to_markdown())
file_result = compare_files("old_policy.md", "new_policy.md", mode="policy")
report = file_result.to_markdown()
Canonical fields include drift_label, overall_score, drift_score, summary, matched_chunks, chunk_matches, claim_changes, tone_shift, risk_flags, warnings, metadata, to_dict(), to_json(), and to_markdown().
Modes
| Mode | Maturity | Best for | Main signals |
|---|---|---|---|
policy |
stable | privacy policies, terms, consent language | sharing, retention, rights, obligations |
prompt |
stable | system prompts and instruction files | safety rules, hidden instructions, scope |
research |
experimental | research drafts and reports | metrics, datasets, baselines, limitations |
resume |
experimental | resumes and bios | titles, metrics, company/project names |
readme |
experimental | README and support docs | install requirements, guarantees, scope |
default |
stable | general text review | drift score, claims, tone, generic risk |
How It Works
SemShift combines transparent signals:
- Chunk alignment by headings and text structure.
- Lexical TF-IDF similarity by default, or optional local SentenceTransformers embeddings.
- Claim extraction, tone signals, and mode-specific risk rules.
TF-IDF is a lexical backend, not a true semantic model. Optional embedding models may download weights on first use; document text is processed locally unless you explicitly integrate external services.
Benchmarks
SemShift includes a starter self-evaluation benchmark for regression tracking. See docs/benchmarks.md.
Do not treat starter benchmark numbers as external validation. Human-labeled external evaluation is still needed.
Compared To
| Tool | What it catches | What it misses |
|---|---|---|
| Git diff | exact text edits | risk, claims, weakened obligations |
| diff-match-patch | text similarity | domain-specific meaning changes |
| LLM judge | broad qualitative review | local determinism, reproducibility, privacy by default |
| Grammar checker | style and grammar | policy, prompt, research, and factual drift |
| SemShift | likely risky semantic drift | subtle context, truth verification, legal authority |
Limitations
SemShift is:
- not legal advice
- not a fact-checker
- not scientific authority
- not a replacement for human review
- likely to miss subtle context-dependent changes
- likely to false-positive on harmless paraphrases
- lexical + heuristic by default
Troubleshooting
semshift: command not found: Confirm the active environment is the one where you installed semshift.
Model import error: Install optional dependencies with pip install "semshift[models]", or use --model tfidf.
Slow first model run: SentenceTransformers may download weights and initialize on first use.
Windows path issues: Quote paths with spaces and prefer PowerShell-compatible quoting.
GitHub Action fork PRs: PR comments can be unavailable for forks with restricted permissions; the report artifact is still written.
No files matched: Pass files, use actions/checkout with fetch-depth: 0, or check supported extensions.
Report too long: GitHub comments are truncated and the full report is uploaded as an artifact.
Roadmap
- stronger external benchmark
- NLI-based deep mode for contradiction/entailment checks
- VS Code extension
- web demo
- docs site
- more file formats
Author
Built by Veeraj Sai.
Citation
Please cite SemShift using CITATION.cff.
License
MIT. See LICENSE.
Security
Report vulnerabilities through GitHub Security Advisories. SemShift is local-first by default, but optional model downloads and external CI integrations should be reviewed in your environment.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semshift-0.2.0.tar.gz.
File metadata
- Download URL: semshift-0.2.0.tar.gz
- Upload date:
- Size: 67.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10d240348c5729fbda5e25db93180e1b4d8fee5c273af246a2d34b7efd6544e3
|
|
| MD5 |
666ffe3dbee275101448f984e2542bfb
|
|
| BLAKE2b-256 |
5a2386f586624bdb8b93aa1604b528774342b8017afaeea0541ed1fe1484e3e4
|
File details
Details for the file semshift-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semshift-0.2.0-py3-none-any.whl
- Upload date:
- Size: 42.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b384011e02575f58c0be645e05de9e149b35be5321d93722c5f4a95fd05ffcd6
|
|
| MD5 |
7b7357310327ae81145c2b704be30b9f
|
|
| BLAKE2b-256 |
726a6e6e08d3fab7b8de5e81bbeb82d6cc7d60736cd5203c22c0c37f36f4ea6a
|