Skip to main content

Quantitative author fingerprinting & stylometric analysis - offline CLI tool

Project description

Stylometry CLI (local/offline) — v1.0

This is a small, offline Python tool to extract stylometric artifacts/patterns from text and optionally compute simple similarity signals between corpora using character n-grams.

It’s designed to slot into your Stylometry Orchestrator workflow by emitting SAO-style ResultBundle_*.json files plus CSV artifacts.

What it does

For each document (and each chunk of a document), it computes:

  • Lexical

    • word count, unique word count
    • average word length
    • MATTR lexical diversity (more length-robust than raw TTR)
  • Syntactic (proxy)

    • average sentence length
    • sentence length variation (population SD)
  • Habitual

    • function word frequencies (configurable list)
    • punctuation rates (commas/semicolons/etc per 1000 words and per sentence)

If 2+ corpora are provided and there are enough chunks, it also computes:

  • Char n-gram TF-IDF centroid cosine similarity across corpora (corpus_similarity_char_ngrams.csv)
  • Nearest-centroid chunk assignment (chunk_assignments_char_ngrams.csv)

Note: these are signals, not definitive authorship proof. Topic/genre/boilerplate can dominate.

Requirements

  • Windows, macOS, or Linux
  • Python 3.12+
  • pip install of dependencies

Install (Windows PowerShell)

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

Quick check:

python -c "import numpy, pandas, sklearn; print('ok')"

Input formats

You provide one or more --corpus LABEL=PATH arguments.

PATH can be:

  • a single .txt / .md file
  • a folder containing .txt / .md files (recursively)
  • a .zip archive containing .txt / .md files (recursively)

Examples of folder layouts that work:

Single corpus

my_corpus/
  speech1.txt
  speech2.txt
  speech3.txt

Multiple corpora

corpora/
  A/
    doc1.txt
    doc2.txt
  B/
    doc3.txt
    doc4.txt

You can point each corpus to its subfolder:

  • --corpus A=corpora/A --corpus B=corpora/B

Run examples

1) Characterize a single document

python stylometry_run.py --task characterize --corpus TextA=./speech1.txt --output ./out_textA

2) Build a profile from many documents (single corpus)

python stylometry_run.py --task profile_build --corpus PersonX=./my_corpus --output ./out_personx

3) Compare two corpora

python stylometry_run.py --task compare --corpus A=./corpora/A --corpus B=./corpora/B --output ./out_compare

4) Use zip archives

python stylometry_run.py --task compare --corpus A=./A.zip --corpus B=./B.zip --output ./out_compare_zip

Outputs

The output folder contains:

  • manifest.json — corpus manifest (doc list + word counts + local provenance paths)
  • doc_metrics.csv — per-document metrics
  • chunk_metrics.csv — per-chunk metrics
  • ResultBundle_ArtifactExtractor.json — SAO-compatible bundle describing artifacts produced
  • run_metadata.json — parameters and reproducibility info

If 2+ corpora and enough chunks:

  • corpus_similarity_char_ngrams.csv
  • chunk_assignments_char_ngrams.csv
  • ResultBundle_Comparator.json

If matplotlib is installed and working, it also saves:

  • plot_avg_sentence_len_boxplot.png
  • plot_mattr_boxplot.png

Useful options

  • --chunk-words 1200 — set chunk size (default 1200)
  • --mattr-window 500 — MATTR window size (default 500)
  • --function-words-file path.txt — override function word list (newline-delimited)
  • --include-chunk-text — include chunk text in chunk_metrics.csv (can be large)
  • --char-analyzer char_wb|char — default char_wb (often better for stylometry)
  • --max-features 50000 and --min-df 2 — control n-gram feature size

Notes for political/public-figure corpora

Prepared remarks and official publications can reflect speechwriters, staff editing, or transcript normalization. Use “channel-specific” corpora where possible (e.g., floor speeches vs press releases vs prepared remarks).

Troubleshooting

  • If plots aren’t produced: ensure matplotlib installed and you have write permission.
  • If Unicode errors: convert source files to UTF-8, or the script will fall back to forgiving decodes.
  • If it’s slow on huge corpora: increase --min-df, reduce --max-features, or reduce corpus size.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stylometry_cli-1.0.0.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stylometry_cli-1.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file stylometry_cli-1.0.0.tar.gz.

File metadata

  • Download URL: stylometry_cli-1.0.0.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stylometry_cli-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4513f192425d9acddeabe5d34279f146f779670c0dbd729ad54b5dc0c99eee2d
MD5 b1ed9322b77bd69421601a09e4241500
BLAKE2b-256 5d3fb87e73cbb2313e80530b59519fe22e305399b4df417c12ebefcd58ea0017

See more details on using hashes here.

Provenance

The following attestation bundles were made for stylometry_cli-1.0.0.tar.gz:

Publisher: publish.yml on SpectreDeath/stylometry-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stylometry_cli-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: stylometry_cli-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stylometry_cli-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 541adf6e270b47d6232d97c4be124817410ea521fef4a1a1bd18cfbfc21df92a
MD5 b4d8330dd7def0ce7cefcf9085742590
BLAKE2b-256 e401066c956387cb1a719cd467d5984d5af2fbc3f258a93cc945bf1612394c26

See more details on using hashes here.

Provenance

The following attestation bundles were made for stylometry_cli-1.0.0-py3-none-any.whl:

Publisher: publish.yml on SpectreDeath/stylometry-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page