Quantitative author fingerprinting & stylometric analysis - offline CLI tool
Project description
Stylometry CLI (local/offline) — v1.0
This is a small, offline Python tool to extract stylometric artifacts/patterns from text and optionally compute simple similarity signals between corpora using character n-grams.
It’s designed to slot into your Stylometry Orchestrator workflow by emitting SAO-style
ResultBundle_*.json files plus CSV artifacts.
What it does
For each document (and each chunk of a document), it computes:
-
Lexical
- word count, unique word count
- average word length
- MATTR lexical diversity (more length-robust than raw TTR)
-
Syntactic (proxy)
- average sentence length
- sentence length variation (population SD)
-
Habitual
- function word frequencies (configurable list)
- punctuation rates (commas/semicolons/etc per 1000 words and per sentence)
If 2+ corpora are provided and there are enough chunks, it also computes:
- Char n-gram TF-IDF centroid cosine similarity across corpora (
corpus_similarity_char_ngrams.csv) - Nearest-centroid chunk assignment (
chunk_assignments_char_ngrams.csv)
Note: these are signals, not definitive authorship proof. Topic/genre/boilerplate can dominate.
Requirements
- Windows, macOS, or Linux
- Python 3.12+
pipinstall of dependencies
Install (Windows PowerShell)
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
Quick check:
python -c "import numpy, pandas, sklearn; print('ok')"
Input formats
You provide one or more --corpus LABEL=PATH arguments.
PATH can be:
- a single
.txt/.mdfile - a folder containing
.txt/.mdfiles (recursively) - a
.ziparchive containing.txt/.mdfiles (recursively)
Examples of folder layouts that work:
Single corpus
my_corpus/
speech1.txt
speech2.txt
speech3.txt
Multiple corpora
corpora/
A/
doc1.txt
doc2.txt
B/
doc3.txt
doc4.txt
You can point each corpus to its subfolder:
--corpus A=corpora/A --corpus B=corpora/B
Run examples
1) Characterize a single document
python stylometry_run.py --task characterize --corpus TextA=./speech1.txt --output ./out_textA
2) Build a profile from many documents (single corpus)
python stylometry_run.py --task profile_build --corpus PersonX=./my_corpus --output ./out_personx
3) Compare two corpora
python stylometry_run.py --task compare --corpus A=./corpora/A --corpus B=./corpora/B --output ./out_compare
4) Use zip archives
python stylometry_run.py --task compare --corpus A=./A.zip --corpus B=./B.zip --output ./out_compare_zip
Outputs
The output folder contains:
manifest.json— corpus manifest (doc list + word counts + local provenance paths)doc_metrics.csv— per-document metricschunk_metrics.csv— per-chunk metricsResultBundle_ArtifactExtractor.json— SAO-compatible bundle describing artifacts producedrun_metadata.json— parameters and reproducibility info
If 2+ corpora and enough chunks:
corpus_similarity_char_ngrams.csvchunk_assignments_char_ngrams.csvResultBundle_Comparator.json
If matplotlib is installed and working, it also saves:
plot_avg_sentence_len_boxplot.pngplot_mattr_boxplot.png
Useful options
--chunk-words 1200— set chunk size (default 1200)--mattr-window 500— MATTR window size (default 500)--function-words-file path.txt— override function word list (newline-delimited)--include-chunk-text— include chunk text inchunk_metrics.csv(can be large)--char-analyzer char_wb|char— defaultchar_wb(often better for stylometry)--max-features 50000and--min-df 2— control n-gram feature size
Notes for political/public-figure corpora
Prepared remarks and official publications can reflect speechwriters, staff editing, or transcript normalization. Use “channel-specific” corpora where possible (e.g., floor speeches vs press releases vs prepared remarks).
Troubleshooting
- If plots aren’t produced: ensure
matplotlibinstalled and you have write permission. - If Unicode errors: convert source files to UTF-8, or the script will fall back to forgiving decodes.
- If it’s slow on huge corpora: increase
--min-df, reduce--max-features, or reduce corpus size.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stylometry_cli-1.0.2.tar.gz.
File metadata
- Download URL: stylometry_cli-1.0.2.tar.gz
- Upload date:
- Size: 29.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bf126d821fa816042de68c23df5722717d5d5f4987675619d05f68f2d6a6082
|
|
| MD5 |
cfa16c2d5f64108660809da444647c25
|
|
| BLAKE2b-256 |
7d95e161b19de0ee9e453d015b7f1e77e32395f9bed26014adaa7ee74f9ccf7b
|
Provenance
The following attestation bundles were made for stylometry_cli-1.0.2.tar.gz:
Publisher:
publish.yml on SpectreDeath/stylometry-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stylometry_cli-1.0.2.tar.gz -
Subject digest:
7bf126d821fa816042de68c23df5722717d5d5f4987675619d05f68f2d6a6082 - Sigstore transparency entry: 832084799
- Sigstore integration time:
-
Permalink:
SpectreDeath/stylometry-cli@d7835cfa0836d6a0fe294045665f51f39c40f3aa -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/SpectreDeath
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d7835cfa0836d6a0fe294045665f51f39c40f3aa -
Trigger Event:
release
-
Statement type:
File details
Details for the file stylometry_cli-1.0.2-py3-none-any.whl.
File metadata
- Download URL: stylometry_cli-1.0.2-py3-none-any.whl
- Upload date:
- Size: 30.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60cc51fbbda677fcf3d8b869d3c350bd7b6257b94a8d9b18f8e40448d9760b8a
|
|
| MD5 |
ce142323fb8986edf8ebf53025684f70
|
|
| BLAKE2b-256 |
ac42e09f229c8ed7c1d32b7dd9bc135127af4c2cef8aaa1e26d400a5dcd03cd2
|
Provenance
The following attestation bundles were made for stylometry_cli-1.0.2-py3-none-any.whl:
Publisher:
publish.yml on SpectreDeath/stylometry-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stylometry_cli-1.0.2-py3-none-any.whl -
Subject digest:
60cc51fbbda677fcf3d8b869d3c350bd7b6257b94a8d9b18f8e40448d9760b8a - Sigstore transparency entry: 832084807
- Sigstore integration time:
-
Permalink:
SpectreDeath/stylometry-cli@d7835cfa0836d6a0fe294045665f51f39c40f3aa -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/SpectreDeath
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d7835cfa0836d6a0fe294045665f51f39c40f3aa -
Trigger Event:
release
-
Statement type: