Skip to main content

Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline.

Project description

๐Ÿ” lmscan

Detect AI-generated text. Fingerprint which LLM wrote it. Open-source GPTZero alternative.

PyPI Downloads License Python CI Tests

GPTZero charges $15/month. Originality.ai charges per scan. Turnitin locks you into institutional contracts.

lmscan is free, open-source, works offline, and tells you which model wrote the text.

$ lmscan "In today's rapidly evolving digital landscape, it's important
to note that artificial intelligence has become a pivotal force in
transforming how we navigate the complexities of modern life..."

๐Ÿ” lmscan v0.1.0 โ€” AI Text Forensics
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

  Verdict:     ๐Ÿค– Likely AI (77% confidence)
  Words:       184
  Sentences:   10
  Scanned in 0.01s

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Feature                    โ”‚ Value    โ”‚ Signal             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Burstiness                 โ”‚ 0.07     โ”‚ ๐Ÿ”ด Very low (AI)    โ”‚
โ”‚ Sentence length variance   โ”‚ 0.27     โ”‚ ๐ŸŸก Below average    โ”‚
โ”‚ Slop word density          โ”‚ 20.7%    โ”‚ ๐Ÿ”ด High (AI)        โ”‚
โ”‚ Transition word ratio      โ”‚ 2.2%     โ”‚ ๐ŸŸก Elevated         โ”‚
โ”‚ Readability consistency    โ”‚ 0.00     โ”‚ ๐Ÿ”ด Very low (AI)    โ”‚
โ”‚ ...                        โ”‚          โ”‚                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”Ž Model Attribution
  1. GPT-4 / ChatGPT    62% โ€” "delve", "tapestry", "beacon", "landscape" (ร—2), +19 more
  2. Claude (Anthropic)  13% โ€” "robust", "nuanced", "comprehensive"
  3. Gemini (Google)      9% โ€” "furthermore", "additionally"

โš ๏ธ  Flags
  โ€ข Very low burstiness (0.07) โ€” AI text is more uniform in complexity
  โ€ข High slop word density (20.7%) โ€” contains known AI vocabulary markers

Install

pip install lmscan

Zero dependencies. Works with Python 3.9+. No API keys. No internet. No GPU.

Usage

# Scan text directly
lmscan "Your text here..."

# Scan a file
lmscan document.txt

# Pipe from stdin
cat essay.txt | lmscan -

# JSON output (for scripts and CI)
lmscan document.txt --format json

# Per-sentence breakdown
lmscan document.txt --sentences

# CI gate: fail if AI probability > 50%
lmscan submission.txt --threshold 0.5

Python API

from lmscan import scan

result = scan("Text to analyze...")

print(f"AI probability: {result.ai_probability:.0%}")
print(f"Verdict: {result.verdict}")
print(f"Confidence: {result.confidence}")

# Which model wrote it?
for model in result.model_attribution:
    print(f"  {model.model}: {model.confidence:.0%}")
    for evidence in model.evidence[:3]:
        print(f"    โ†’ {evidence}")

# Per-sentence analysis
for sentence in result.sentence_scores:
    if sentence.ai_probability > 0.7:
        print(f"  ๐Ÿค– {sentence.text[:60]}... ({sentence.ai_probability:.0%})")

Scan entire directories

from lmscan import scan_file
import glob

for path in glob.glob("submissions/*.txt"):
    result = scan_file(path)
    print(f"{path}: {result.verdict} ({result.ai_probability:.0%})")

How It Works

lmscan uses 12 statistical features derived from computational linguistics research to distinguish AI-generated text from human writing:

Feature What it measures AI signal
Burstiness Variance in sentence complexity AI text is unusually uniform
Sentence length variance How much sentence lengths vary AI produces uniform lengths
Vocabulary richness Type-token ratio (Yule's K corrected) AI reuses words more
Hapax legomena ratio Fraction of words appearing once AI has fewer unique words
Zipf deviation How word frequencies follow Zipf's law AI deviates from natural distribution
Readability consistency Flesch-Kincaid variance across paragraphs AI maintains constant readability
Bigram/trigram repetition Repeated word pairs and triples AI repeats phrase structures
Transition word ratio "however", "moreover", "furthermore"... AI overuses transitions
Slop word density Known AI vocabulary markers "delve", "tapestry", "beacon"...
Punctuation entropy Diversity of punctuation usage AI is more predictable

Each feature produces a signal via sigmoid transformation. The weighted combination produces the final AI probability.

Model Fingerprinting

lmscan includes vocabulary fingerprints for 5 major LLM families:

Model Distinctive markers
GPT-4 / ChatGPT "delve", "tapestry", "landscape", "leverage", "multifaceted", "it's important to note"
Claude (Anthropic) "certainly", "I'd be happy to", "straightforward", "I should note"
Gemini (Google) "crucial", "here's a breakdown", "keep in mind"
Llama / Meta "awesome", "fantastic", "hope this helps"
Mistral / Mixtral "indeed", "moreover", "hence", "noteworthy"

Attribution uses weighted vocabulary matching, phrase detection, and hedging pattern analysis.

Accuracy & Limitations

What lmscan is good at:

  • Detecting text with strong AI stylistic patterns
  • Identifying which model family generated text
  • Scanning at scale (thousands of documents) with zero cost
  • Providing explainable evidence (not a black box)

What lmscan cannot do:

  • Detect AI text that has been manually edited or paraphrased
  • Work reliably on very short text (<50 words)
  • Detect AI text in non-English languages (English-only for now)
  • Replace human judgment โ€” use as a signal, not a verdict

This is statistical analysis, not a neural classifier. It detects stylistic patterns, not watermarks. It works best on unedited LLM output and degrades gracefully on edited text.

CI Integration

GitHub Actions

- name: AI Content Check
  run: |
    pip install lmscan
    lmscan submission.txt --threshold 0.7 --format json

Pre-commit

repos:
  - repo: https://github.com/stef41/lmscan
    rev: v0.1.0
    hooks:
      - id: lmscan
        args: ["--threshold", "0.7"]

Research Background

lmscan's approach is informed by published research on AI text detection:

  • DetectGPT (Mitchell et al., 2023) โ€” perturbation-based detection using log probability curvature
  • GLTR (Gehrmann et al., 2019) โ€” statistical visualization of token predictions
  • Binoculars (Hans et al., 2024) โ€” cross-model perplexity comparison
  • Zipf's Law in NLP โ€” word frequency distributions differ between human and AI text
  • Stylometry โ€” decades of authorship attribution research applied to AI forensics

lmscan takes the statistical intuitions from these papers and implements them as lightweight, dependency-free heuristics that work without requiring a reference language model.

FAQ

Q: Is this as accurate as GPTZero? A: GPTZero uses neural classifiers trained on labeled data. lmscan uses statistical heuristics. GPTZero is more accurate on edge cases; lmscan is free, offline, and explainable. Use both if accuracy matters.

Q: Can students use this to evade AI detection? A: lmscan shows which features trigger detection, which could help someone understand why text reads as AI-generated. This is by design โ€” understanding AI writing patterns makes everyone a better writer. The same information is available in published research papers.

Q: Does it work on non-English text? A: Currently English-only. The slop word lists and transition word lists are English-specific. Statistical features (entropy, burstiness) work across languages but haven't been calibrated.

Q: Does it phone home? A: No. Zero network requests. No telemetry. No API keys. Everything runs locally.

Q: How is model attribution possible without running the model? A: Each LLM family has characteristic vocabulary biases. GPT-4 loves "delve" and "tapestry". Claude says "I'd be happy to". These are statistical fingerprints โ€” not guaranteed attribution, but strong signals.

See Also

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmscan-0.3.0.tar.gz (45.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmscan-0.3.0-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file lmscan-0.3.0.tar.gz.

File metadata

  • Download URL: lmscan-0.3.0.tar.gz
  • Upload date:
  • Size: 45.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for lmscan-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a89443ba03a3a581e3b1608ce5a16f245c5b4e8ea642fc361c9827bd159e5a59
MD5 c5f8135043d0ca3221559a4434006576
BLAKE2b-256 570609f9ef4f9f078697c01ba0545fe58fa0b074a640d13ddbb5050305d92b31

See more details on using hashes here.

File details

Details for the file lmscan-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lmscan-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for lmscan-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9f58b58fe9ab914a499313d929143f05aedfbd1c5989ae84ab371d199eb66b43
MD5 20c732751a5a8d8a2b2d9572c7ab14b9
BLAKE2b-256 dea4c11a32077abc0c44be4cda47f8b6fdc73d8584c1b3eef733e800f99c8bb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page