Skip to main content

Quantitative author fingerprinting & stylometric analysis - offline CLI tool

Project description

Stylometry CLI (local/offline) — v1.0

This is a small, offline Python tool to extract stylometric artifacts/patterns from text and optionally compute simple similarity signals between corpora using character n-grams.

It’s designed to slot into your Stylometry Orchestrator workflow by emitting SAO-style ResultBundle_*.json files plus CSV artifacts.

What it does

For each document (and each chunk of a document), it computes:

  • Lexical

    • word count, unique word count
    • average word length
    • MATTR lexical diversity (more length-robust than raw TTR)
  • Syntactic (proxy)

    • average sentence length
    • sentence length variation (population SD)
  • Habitual

    • function word frequencies (configurable list)
    • punctuation rates (commas/semicolons/etc per 1000 words and per sentence)

If 2+ corpora are provided and there are enough chunks, it also computes:

  • Char n-gram TF-IDF centroid cosine similarity across corpora (corpus_similarity_char_ngrams.csv)
  • Nearest-centroid chunk assignment (chunk_assignments_char_ngrams.csv)

Note: these are signals, not definitive authorship proof. Topic/genre/boilerplate can dominate.

Requirements

  • Windows, macOS, or Linux
  • Python 3.12+
  • pip install of dependencies

Install (Windows PowerShell)

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

Quick check:

python -c "import numpy, pandas, sklearn; print('ok')"

Input formats

You provide one or more --corpus LABEL=PATH arguments.

PATH can be:

  • a single .txt / .md file
  • a folder containing .txt / .md files (recursively)
  • a .zip archive containing .txt / .md files (recursively)

Examples of folder layouts that work:

Single corpus

my_corpus/
  speech1.txt
  speech2.txt
  speech3.txt

Multiple corpora

corpora/
  A/
    doc1.txt
    doc2.txt
  B/
    doc3.txt
    doc4.txt

You can point each corpus to its subfolder:

  • --corpus A=corpora/A --corpus B=corpora/B

Run examples

1) Characterize a single document

python stylometry_run.py --task characterize --corpus TextA=./speech1.txt --output ./out_textA

2) Build a profile from many documents (single corpus)

python stylometry_run.py --task profile_build --corpus PersonX=./my_corpus --output ./out_personx

3) Compare two corpora

python stylometry_run.py --task compare --corpus A=./corpora/A --corpus B=./corpora/B --output ./out_compare

4) Use zip archives

python stylometry_run.py --task compare --corpus A=./A.zip --corpus B=./B.zip --output ./out_compare_zip

Outputs

The output folder contains:

  • manifest.json — corpus manifest (doc list + word counts + local provenance paths)
  • doc_metrics.csv — per-document metrics
  • chunk_metrics.csv — per-chunk metrics
  • ResultBundle_ArtifactExtractor.json — SAO-compatible bundle describing artifacts produced
  • run_metadata.json — parameters and reproducibility info

If 2+ corpora and enough chunks:

  • corpus_similarity_char_ngrams.csv
  • chunk_assignments_char_ngrams.csv
  • ResultBundle_Comparator.json

If matplotlib is installed and working, it also saves:

  • plot_avg_sentence_len_boxplot.png
  • plot_mattr_boxplot.png

Useful options

  • --chunk-words 1200 — set chunk size (default 1200)
  • --mattr-window 500 — MATTR window size (default 500)
  • --function-words-file path.txt — override function word list (newline-delimited)
  • --include-chunk-text — include chunk text in chunk_metrics.csv (can be large)
  • --char-analyzer char_wb|char — default char_wb (often better for stylometry)
  • --max-features 50000 and --min-df 2 — control n-gram feature size

Notes for political/public-figure corpora

Prepared remarks and official publications can reflect speechwriters, staff editing, or transcript normalization. Use “channel-specific” corpora where possible (e.g., floor speeches vs press releases vs prepared remarks).

Troubleshooting

  • If plots aren’t produced: ensure matplotlib installed and you have write permission.
  • If Unicode errors: convert source files to UTF-8, or the script will fall back to forgiving decodes.
  • If it’s slow on huge corpora: increase --min-df, reduce --max-features, or reduce corpus size.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stylometry_cli-1.0.2.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stylometry_cli-1.0.2-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file stylometry_cli-1.0.2.tar.gz.

File metadata

  • Download URL: stylometry_cli-1.0.2.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stylometry_cli-1.0.2.tar.gz
Algorithm Hash digest
SHA256 7bf126d821fa816042de68c23df5722717d5d5f4987675619d05f68f2d6a6082
MD5 cfa16c2d5f64108660809da444647c25
BLAKE2b-256 7d95e161b19de0ee9e453d015b7f1e77e32395f9bed26014adaa7ee74f9ccf7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for stylometry_cli-1.0.2.tar.gz:

Publisher: publish.yml on SpectreDeath/stylometry-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stylometry_cli-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: stylometry_cli-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stylometry_cli-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 60cc51fbbda677fcf3d8b869d3c350bd7b6257b94a8d9b18f8e40448d9760b8a
MD5 ce142323fb8986edf8ebf53025684f70
BLAKE2b-256 ac42e09f229c8ed7c1d32b7dd9bc135127af4c2cef8aaa1e26d400a5dcd03cd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for stylometry_cli-1.0.2-py3-none-any.whl:

Publisher: publish.yml on SpectreDeath/stylometry-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page