Skip to main content

Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline.

Project description

๐Ÿ” lmscan

Detect AI-generated text. Fingerprint which LLM wrote it. Open-source GPTZero alternative.

PyPI License Python Tests

GPTZero charges $15/month. Originality.ai charges per scan. Turnitin locks you into institutional contracts.

lmscan is free, open-source, works offline, and tells you which model wrote the text.

$ lmscan "In today's rapidly evolving digital landscape, it's important
to note that artificial intelligence has become a pivotal force in
transforming how we navigate the complexities of modern life..."

๐Ÿ” lmscan v0.1.0 โ€” AI Text Forensics
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

  Verdict:     ๐Ÿค– Likely AI (77% confidence)
  Words:       184
  Sentences:   10
  Scanned in 0.01s

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Feature                    โ”‚ Value    โ”‚ Signal             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Burstiness                 โ”‚ 0.07     โ”‚ ๐Ÿ”ด Very low (AI)    โ”‚
โ”‚ Sentence length variance   โ”‚ 0.27     โ”‚ ๐ŸŸก Below average    โ”‚
โ”‚ Slop word density          โ”‚ 20.7%    โ”‚ ๐Ÿ”ด High (AI)        โ”‚
โ”‚ Transition word ratio      โ”‚ 2.2%     โ”‚ ๐ŸŸก Elevated         โ”‚
โ”‚ Readability consistency    โ”‚ 0.00     โ”‚ ๐Ÿ”ด Very low (AI)    โ”‚
โ”‚ ...                        โ”‚          โ”‚                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”Ž Model Attribution
  1. GPT-4 / ChatGPT    62% โ€” "delve", "tapestry", "beacon", "landscape" (ร—2), +19 more
  2. Claude (Anthropic)  13% โ€” "robust", "nuanced", "comprehensive"
  3. Gemini (Google)      9% โ€” "furthermore", "additionally"

โš ๏ธ  Flags
  โ€ข Very low burstiness (0.07) โ€” AI text is more uniform in complexity
  โ€ข High slop word density (20.7%) โ€” contains known AI vocabulary markers

Install

pip install lmscan

Zero dependencies. Works with Python 3.9+. No API keys. No internet. No GPU.

Usage

# Scan text directly
lmscan "Your text here..."

# Scan a file
lmscan document.txt

# Pipe from stdin
cat essay.txt | lmscan -

# JSON output (for scripts and CI)
lmscan document.txt --format json

# Per-sentence breakdown
lmscan document.txt --sentences

# CI gate: fail if AI probability > 50%
lmscan submission.txt --threshold 0.5

Python API

from lmscan import scan

result = scan("Text to analyze...")

print(f"AI probability: {result.ai_probability:.0%}")
print(f"Verdict: {result.verdict}")
print(f"Confidence: {result.confidence}")

# Which model wrote it?
for model in result.model_attribution:
    print(f"  {model.model}: {model.confidence:.0%}")
    for evidence in model.evidence[:3]:
        print(f"    โ†’ {evidence}")

# Per-sentence analysis
for sentence in result.sentence_scores:
    if sentence.ai_probability > 0.7:
        print(f"  ๐Ÿค– {sentence.text[:60]}... ({sentence.ai_probability:.0%})")

Scan entire directories

from lmscan import scan_file
import glob

for path in glob.glob("submissions/*.txt"):
    result = scan_file(path)
    print(f"{path}: {result.verdict} ({result.ai_probability:.0%})")

How It Works

lmscan uses 12 statistical features derived from computational linguistics research to distinguish AI-generated text from human writing:

Feature What it measures AI signal
Burstiness Variance in sentence complexity AI text is unusually uniform
Sentence length variance How much sentence lengths vary AI produces uniform lengths
Vocabulary richness Type-token ratio (Yule's K corrected) AI reuses words more
Hapax legomena ratio Fraction of words appearing once AI has fewer unique words
Zipf deviation How word frequencies follow Zipf's law AI deviates from natural distribution
Readability consistency Flesch-Kincaid variance across paragraphs AI maintains constant readability
Bigram/trigram repetition Repeated word pairs and triples AI repeats phrase structures
Transition word ratio "however", "moreover", "furthermore"... AI overuses transitions
Slop word density Known AI vocabulary markers "delve", "tapestry", "beacon"...
Punctuation entropy Diversity of punctuation usage AI is more predictable

Each feature produces a signal via sigmoid transformation. The weighted combination produces the final AI probability.

Model Fingerprinting

lmscan includes vocabulary fingerprints for 5 major LLM families:

Model Distinctive markers
GPT-4 / ChatGPT "delve", "tapestry", "landscape", "leverage", "multifaceted", "it's important to note"
Claude (Anthropic) "certainly", "I'd be happy to", "straightforward", "I should note"
Gemini (Google) "crucial", "here's a breakdown", "keep in mind"
Llama / Meta "awesome", "fantastic", "hope this helps"
Mistral / Mixtral "indeed", "moreover", "hence", "noteworthy"

Attribution uses weighted vocabulary matching, phrase detection, and hedging pattern analysis.

Accuracy & Limitations

What lmscan is good at:

  • Detecting text with strong AI stylistic patterns
  • Identifying which model family generated text
  • Scanning at scale (thousands of documents) with zero cost
  • Providing explainable evidence (not a black box)

What lmscan cannot do:

  • Detect AI text that has been manually edited or paraphrased
  • Work reliably on very short text (<50 words)
  • Detect AI text in non-English languages (English-only for now)
  • Replace human judgment โ€” use as a signal, not a verdict

This is statistical analysis, not a neural classifier. It detects stylistic patterns, not watermarks. It works best on unedited LLM output and degrades gracefully on edited text.

CI Integration

GitHub Actions

- name: AI Content Check
  run: |
    pip install lmscan
    lmscan submission.txt --threshold 0.7 --format json

Pre-commit

repos:
  - repo: https://github.com/stef41/lmscan
    rev: v0.1.0
    hooks:
      - id: lmscan
        args: ["--threshold", "0.7"]

Research Background

lmscan's approach is informed by published research on AI text detection:

  • DetectGPT (Mitchell et al., 2023) โ€” perturbation-based detection using log probability curvature
  • GLTR (Gehrmann et al., 2019) โ€” statistical visualization of token predictions
  • Binoculars (Hans et al., 2024) โ€” cross-model perplexity comparison
  • Zipf's Law in NLP โ€” word frequency distributions differ between human and AI text
  • Stylometry โ€” decades of authorship attribution research applied to AI forensics

lmscan takes the statistical intuitions from these papers and implements them as lightweight, dependency-free heuristics that work without requiring a reference language model.

FAQ

Q: Is this as accurate as GPTZero? A: GPTZero uses neural classifiers trained on labeled data. lmscan uses statistical heuristics. GPTZero is more accurate on edge cases; lmscan is free, offline, and explainable. Use both if accuracy matters.

Q: Can students use this to evade AI detection? A: lmscan shows which features trigger detection, which could help someone understand why text reads as AI-generated. This is by design โ€” understanding AI writing patterns makes everyone a better writer. The same information is available in published research papers.

Q: Does it work on non-English text? A: Currently English-only. The slop word lists and transition word lists are English-specific. Statistical features (entropy, burstiness) work across languages but haven't been calibrated.

Q: Does it phone home? A: No. Zero network requests. No telemetry. No API keys. Everything runs locally.

Q: How is model attribution possible without running the model? A: Each LLM family has characteristic vocabulary biases. GPT-4 loves "delve" and "tapestry". Claude says "I'd be happy to". These are statistical fingerprints โ€” not guaranteed attribution, but strong signals.

See Also

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmscan-0.2.0.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmscan-0.2.0-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file lmscan-0.2.0.tar.gz.

File metadata

  • Download URL: lmscan-0.2.0.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for lmscan-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3c1f892ec19f7411b8b7ca90191be3d0119b9d831e9de97ca04967f56598e60f
MD5 d200f1fe27b55368edbc1f880371a7e2
BLAKE2b-256 dc435df50d8ea46a2e11c8056877883a3acb1d788e8cc6fd8a4bcb4813aea6bf

See more details on using hashes here.

File details

Details for the file lmscan-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: lmscan-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for lmscan-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1985196b593265c88e65524fe18e2714cdcf411429f0b4d6baa12e58885ad35e
MD5 59939ab7a93d748d858c12e76fdce290
BLAKE2b-256 de86c686773ae85903e183079c85389c15cba521510c848b9c3d3ee5593345fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page