Skip to main content

Authorship attribution and stylometric analysis in Python

Project description

stylometry-python

Authorship attribution and stylometric analysis in Python.

A lightweight, dependency-minimal library for measuring writing style, attributing authorship, and detecting stylistic shifts introduced by LLMs.

pip install stylometry-python

Python 3.9+ License: MIT CI Publish PyPI version codecov


What is stylometry?

Stylometry is the statistical analysis of writing style. Every author has unconscious stylistic habits — frequency of function words, sentence length patterns, punctuation choices — that form a measurable fingerprint.

Mosteller & Wallace used it to resolve the Federalist Papers authorship debate in 1964. Patrick Juola used it to identify JK Rowling behind the pseudonym Robert Galbraith in 2013.

This library makes those techniques accessible in 5 lines of Python.


Quickstart

from stylometry import StyleAnalyzer

sa = StyleAnalyzer()

# Fit on known texts
sa.fit(zola_texts, label="Zola")
sa.fit(maupassant_texts, label="Maupassant")

# Attribute an unknown text
predicted, distances = sa.predict(unknown_text)
print(f"Predicted author: {predicted}")
# → Predicted author: Zola

# Measure stylistic shift (original vs LLM rewrite)
shift = sa.shift(original_text, gpt_rewrite)
print(f"Stylistic shift: {shift:.4f}")
# → Stylistic shift: 0.2409

Installation

pip install stylometry-python

Dependencies: numpy, matplotlib, scikit-learn — nothing else. Works 100% offline. No API keys. No GPU.

Development setup

On macOS (Homebrew Python), use a virtual environment to avoid externally-managed-environment errors:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements-dev.txt

Equivalent direct command:

python3 -m pip install -e ".[dev]"

Run tests:

python3 -m pytest

For a full local workflow (venv, tests, coverage, lint, format), see docs/DEVELOPMENT.md.

Continuous Integration

GitHub Actions runs lint + tests on each push and pull request:

  • ruff check .
  • black --check .
  • pytest --cov=stylometry --cov-report=term-missing

Workflow file: .github/workflows/ci.yml

Publishing

Package publication is automated after a successful Release workflow run. Tags are generated from Conventional Commits by semantic-release. Semantic release runs only after CI passes on the target branch.

Release workflow: .github/workflows/release.yml Workflow file: .github/workflows/publish.yml


Core API

StyleAnalyzer(function_words=None, language='fr', min_words=50)

The main class. Handles vectorization, attribution, and visualization.

from stylometry import StyleAnalyzer

# French (default) — 41 function words
sa = StyleAnalyzer()

# Custom vocabulary
sa = StyleAnalyzer(function_words=['the', 'of', 'and', 'to', 'a', 'in'])

# English preset
sa = StyleAnalyzer(language='en')

vectorize(text) → np.ndarray

Convert a text to a style vector (L2-normalized function word frequencies).

v = sa.vectorize("Il pleuvait a verse. La nuit etait noire...")
print(v.shape)  # (41,)
print(v.sum())  # ≈ 1.0 after normalization

fit(texts, label) → self

Compute a centroid from a list of texts. Chainable.

sa.fit(zola_corpus, "Zola").fit(maupassant_corpus, "Maupassant")

predict(text) → (label, distances)

Attribute a text to the nearest centroid.

predicted, distances = sa.predict(unknown)

print(predicted)           # "Zola"
print(distances)           # {"Zola": 0.12, "Maupassant": 0.43}
print(sa.confidence(distances))  # "HIGH" / "MEDIUM" / "LOW"

shift(original, rewrite) → float

Measure the cosine distance between two texts in style space. Use this to quantify how much an LLM changed the style of a text.

shift = sa.shift(original, gpt4_rewrite)
# 0.00 = style unchanged
# 0.24 = significant shift (typical GPT-4)
# 1.00 = maximally different

cosine_distance(text_a, text_b) → float

Direct cosine distance between two texts.

d = sa.cosine_distance(text_a, text_b)

Visualization

plot_fingerprint(texts_dict, top_n=15)

Bar chart comparing function word frequencies across groups.

fig = sa.plot_fingerprint(
    texts_dict={
        "Zola": zola_corpus,
        "Maupassant": maupassant_corpus,
        "GPT-4": gpt4_corpus,
    },
    top_n=12,
    title="Writing fingerprints",
)
fig.savefig("fingerprints.png", dpi=150)

plot_clusters(texts_groups, labels)

PCA scatter plot — visualize stylistic distances between groups.

fig = sa.plot_clusters(
    texts_groups=[zola_corpus, maupassant_corpus, gpt4_corpus],
    labels=["Zola", "Maupassant", "GPT-4"],
    title="Do LLMs form a distinct stylistic cluster?",
)

plot_shift_distribution(originals, rewrites_dict)

Box plot of cosine shifts per model.

fig = sa.plot_shift_distribution(
    originals=original_texts,
    rewrites_dict={
        "GPT-4": gpt4_rewrites,
        "Claude 3": claude_rewrites,
    },
)

Code Stylometry

Apply stylometry to source code. Measure developer fingerprints.

from stylometry.code import CodeAnalyzer

ca = CodeAnalyzer()

# Fit on known code samples
ca.fit(alice_code_files, label="Alice")
ca.fit(bob_code_files, label="Bob")

# Attribute an unknown file
predicted, distances = ca.predict(unknown_file)
print(f"Predicted author: {predicted}")

# Detect Copilot patterns
copilot_score = ca.copilot_score(code_file)
print(f"Copilot likelihood: {copilot_score:.2f}")

Code features measured:

Feature Description
camelCase_ratio Fraction of identifiers in camelCase
snake_case_ratio Fraction of identifiers in snake_case
comment_density Comment lines / total non-empty lines
docstring_density Docstring occurrences / non-empty lines
type_hint_usage Type annotations per line
list_comp_usage List comprehensions per line
avg_line_length Average line length (normalized)
blank_line_ratio Blank lines / total lines

Examples

See the examples/ directory:

  • examples/rowling.py — Reproduce the Rowling identification experiment
  • examples/llm_shift.py — Measure GPT-4 stylistic shift on your own texts
  • examples/code_attribution.py — Attribute code files to developers
cd examples
python rowling.py
# → Most likely author: Rowling (distance: 0.18)
# → Second closest:     Rendell (distance: 0.31)

Limitations

Stylometry provides probabilistic signals, not forensic proof.

  • Minimum ~100 words per text for reliable results
  • Function word analysis is language-dependent
  • Cross-domain generalization degrades significantly
  • LLM detection is prompt-dependent and model-dependent

See LIMITATIONS.md for a full discussion.


References

  • Mosteller & Wallace (1964). Inference and Disputed Authorship: The Federalist.
  • Juola (2015). The Rowling Case. DSH, Oxford.
  • Stamatatos (2009). A Survey of Modern Authorship Attribution Methods. JASIST.
  • Kestemont et al. (2020). PAN @ CLEF 2020 Authorship Verification.
  • Caliskan et al. (2015). De-anonymizing Programmers via Code Stylometry. USENIX.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stylometry_python-1.2.0.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stylometry_python-1.2.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file stylometry_python-1.2.0.tar.gz.

File metadata

  • Download URL: stylometry_python-1.2.0.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for stylometry_python-1.2.0.tar.gz
Algorithm Hash digest
SHA256 75e27102ea031ddf1e87069a7343b0a34e667948f08f5cda89db0bea9590bb2c
MD5 5216912e18ad6fa288f3f9969089a624
BLAKE2b-256 0e82022b653786fca4cd185b34022afdd705b407c5158d3ed41bb429b83927d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for stylometry_python-1.2.0.tar.gz:

Publisher: publish.yml on riadmaouchi/stylometry-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stylometry_python-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for stylometry_python-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cbc2be5f61829f2a073b6f35ebb3ff24aaed347d4c3defeb3ba3d4e226c998fd
MD5 4900c88cdaa4c6c01332478e9ccb59a9
BLAKE2b-256 cb7ae7cc9f77c4ae2fbe4b2266b696e54c750360b91d6d1473ccfbeaac09bfc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for stylometry_python-1.2.0-py3-none-any.whl:

Publisher: publish.yml on riadmaouchi/stylometry-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page