Authorship attribution and stylometric analysis in Python
Project description
stylometry-python
Authorship attribution and stylometric analysis in Python.
A lightweight, dependency-minimal library for measuring writing style, attributing authorship, and detecting stylistic shifts introduced by LLMs.
pip install stylometry-python
What is stylometry?
Stylometry is the statistical analysis of writing style. Every author has unconscious stylistic habits — frequency of function words, sentence length patterns, punctuation choices — that form a measurable fingerprint.
Mosteller & Wallace used it to resolve the Federalist Papers authorship debate in 1964. Patrick Juola used it to identify JK Rowling behind the pseudonym Robert Galbraith in 2013.
This library makes those techniques accessible in 5 lines of Python.
Quickstart
from stylometry import StyleAnalyzer
sa = StyleAnalyzer()
# Fit on known texts
sa.fit(zola_texts, label="Zola")
sa.fit(maupassant_texts, label="Maupassant")
# Attribute an unknown text
predicted, distances = sa.predict(unknown_text)
print(f"Predicted author: {predicted}")
# → Predicted author: Zola
# Measure stylistic shift (original vs LLM rewrite)
shift = sa.shift(original_text, gpt_rewrite)
print(f"Stylistic shift: {shift:.4f}")
# → Stylistic shift: 0.2409
Installation
pip install stylometry-python
Dependencies: numpy, matplotlib, scikit-learn — nothing else. Works 100% offline. No API keys. No GPU.
Development setup
On macOS (Homebrew Python), use a virtual environment to avoid
externally-managed-environment errors:
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements-dev.txt
Equivalent direct command:
python3 -m pip install -e ".[dev]"
Run tests:
python3 -m pytest
For a full local workflow (venv, tests, coverage, lint, format), see
docs/DEVELOPMENT.md.
Continuous Integration
GitHub Actions runs lint + tests on each push and pull request:
ruff check .black --check .pytest --cov=stylometry --cov-report=term-missing
Workflow file: .github/workflows/ci.yml
Publishing
Package publication is automated after a successful Release workflow run.
Tags are generated from Conventional Commits by semantic-release.
Semantic release runs only after CI passes on the target branch.
Release workflow: .github/workflows/release.yml
Workflow file: .github/workflows/publish.yml
Core API
StyleAnalyzer(function_words=None, language='fr', min_words=50)
The main class. Handles vectorization, attribution, and visualization.
from stylometry import StyleAnalyzer
# French (default) — 41 function words
sa = StyleAnalyzer()
# Custom vocabulary
sa = StyleAnalyzer(function_words=['the', 'of', 'and', 'to', 'a', 'in'])
# English preset
sa = StyleAnalyzer(language='en')
vectorize(text) → np.ndarray
Convert a text to a style vector (L2-normalized function word frequencies).
v = sa.vectorize("Il pleuvait a verse. La nuit etait noire...")
print(v.shape) # (41,)
print(v.sum()) # ≈ 1.0 after normalization
fit(texts, label) → self
Compute a centroid from a list of texts. Chainable.
sa.fit(zola_corpus, "Zola").fit(maupassant_corpus, "Maupassant")
predict(text) → (label, distances)
Attribute a text to the nearest centroid.
predicted, distances = sa.predict(unknown)
print(predicted) # "Zola"
print(distances) # {"Zola": 0.12, "Maupassant": 0.43}
print(sa.confidence(distances)) # "HIGH" / "MEDIUM" / "LOW"
shift(original, rewrite) → float
Measure the cosine distance between two texts in style space. Use this to quantify how much an LLM changed the style of a text.
shift = sa.shift(original, gpt4_rewrite)
# 0.00 = style unchanged
# 0.24 = significant shift (typical GPT-4)
# 1.00 = maximally different
cosine_distance(text_a, text_b) → float
Direct cosine distance between two texts.
d = sa.cosine_distance(text_a, text_b)
Visualization
plot_fingerprint(texts_dict, top_n=15)
Bar chart comparing function word frequencies across groups.
fig = sa.plot_fingerprint(
texts_dict={
"Zola": zola_corpus,
"Maupassant": maupassant_corpus,
"GPT-4": gpt4_corpus,
},
top_n=12,
title="Writing fingerprints",
)
fig.savefig("fingerprints.png", dpi=150)
plot_clusters(texts_groups, labels)
PCA scatter plot — visualize stylistic distances between groups.
fig = sa.plot_clusters(
texts_groups=[zola_corpus, maupassant_corpus, gpt4_corpus],
labels=["Zola", "Maupassant", "GPT-4"],
title="Do LLMs form a distinct stylistic cluster?",
)
plot_shift_distribution(originals, rewrites_dict)
Box plot of cosine shifts per model.
fig = sa.plot_shift_distribution(
originals=original_texts,
rewrites_dict={
"GPT-4": gpt4_rewrites,
"Claude 3": claude_rewrites,
},
)
Code Stylometry
Apply stylometry to source code. Measure developer fingerprints.
from stylometry.code import CodeAnalyzer
ca = CodeAnalyzer()
# Fit on known code samples
ca.fit(alice_code_files, label="Alice")
ca.fit(bob_code_files, label="Bob")
# Attribute an unknown file
predicted, distances = ca.predict(unknown_file)
print(f"Predicted author: {predicted}")
# Detect Copilot patterns
copilot_score = ca.copilot_score(code_file)
print(f"Copilot likelihood: {copilot_score:.2f}")
Code features measured:
| Feature | Description |
|---|---|
camelCase_ratio |
Fraction of identifiers in camelCase |
snake_case_ratio |
Fraction of identifiers in snake_case |
comment_density |
Comment lines / total non-empty lines |
docstring_density |
Docstring occurrences / non-empty lines |
type_hint_usage |
Type annotations per line |
list_comp_usage |
List comprehensions per line |
avg_line_length |
Average line length (normalized) |
blank_line_ratio |
Blank lines / total lines |
Examples
See the examples/ directory:
examples/rowling.py— Reproduce the Rowling identification experimentexamples/llm_shift.py— Measure GPT-4 stylistic shift on your own textsexamples/code_attribution.py— Attribute code files to developers
cd examples
python rowling.py
# → Most likely author: Rowling (distance: 0.18)
# → Second closest: Rendell (distance: 0.31)
Limitations
Stylometry provides probabilistic signals, not forensic proof.
- Minimum ~100 words per text for reliable results
- Function word analysis is language-dependent
- Cross-domain generalization degrades significantly
- LLM detection is prompt-dependent and model-dependent
See LIMITATIONS.md for a full discussion.
References
- Mosteller & Wallace (1964). Inference and Disputed Authorship: The Federalist.
- Juola (2015). The Rowling Case. DSH, Oxford.
- Stamatatos (2009). A Survey of Modern Authorship Attribution Methods. JASIST.
- Kestemont et al. (2020). PAN @ CLEF 2020 Authorship Verification.
- Caliskan et al. (2015). De-anonymizing Programmers via Code Stylometry. USENIX.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stylometry_python-1.2.0.tar.gz.
File metadata
- Download URL: stylometry_python-1.2.0.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75e27102ea031ddf1e87069a7343b0a34e667948f08f5cda89db0bea9590bb2c
|
|
| MD5 |
5216912e18ad6fa288f3f9969089a624
|
|
| BLAKE2b-256 |
0e82022b653786fca4cd185b34022afdd705b407c5158d3ed41bb429b83927d8
|
Provenance
The following attestation bundles were made for stylometry_python-1.2.0.tar.gz:
Publisher:
publish.yml on riadmaouchi/stylometry-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stylometry_python-1.2.0.tar.gz -
Subject digest:
75e27102ea031ddf1e87069a7343b0a34e667948f08f5cda89db0bea9590bb2c - Sigstore transparency entry: 1587288846
- Sigstore integration time:
-
Permalink:
riadmaouchi/stylometry-python@785da2087a057fb6c8fba27f7987eaaaf0ae7ccf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/riadmaouchi
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@785da2087a057fb6c8fba27f7987eaaaf0ae7ccf -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file stylometry_python-1.2.0-py3-none-any.whl.
File metadata
- Download URL: stylometry_python-1.2.0-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbc2be5f61829f2a073b6f35ebb3ff24aaed347d4c3defeb3ba3d4e226c998fd
|
|
| MD5 |
4900c88cdaa4c6c01332478e9ccb59a9
|
|
| BLAKE2b-256 |
cb7ae7cc9f77c4ae2fbe4b2266b696e54c750360b91d6d1473ccfbeaac09bfc2
|
Provenance
The following attestation bundles were made for stylometry_python-1.2.0-py3-none-any.whl:
Publisher:
publish.yml on riadmaouchi/stylometry-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stylometry_python-1.2.0-py3-none-any.whl -
Subject digest:
cbc2be5f61829f2a073b6f35ebb3ff24aaed347d4c3defeb3ba3d4e226c998fd - Sigstore transparency entry: 1587289204
- Sigstore integration time:
-
Permalink:
riadmaouchi/stylometry-python@785da2087a057fb6c8fba27f7987eaaaf0ae7ccf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/riadmaouchi
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@785da2087a057fb6c8fba27f7987eaaaf0ae7ccf -
Trigger Event:
workflow_run
-
Statement type: