Next-generation computational stylometry — a Python replacement for R's Stylo.
Project description
bitig ("writing, inscription, charter" — from Old Turkic) is a Python package and
interactive CLI for authorship attribution, author-group style comparison, and
forensic-linguistic analysis. It reimplements the analytical breadth of R's Stylo,
then adds a modern NLP pipeline (spaCy, transformer embeddings), a Bayesian layer
(PyMC), and a full forensic-evidential toolkit on top.
Named after the bitig, the Turkic word for writing / inscription — the kind chiselled into the 8th-century Orkhon stelae. A bitig was a recorded text bearing a writer's hand; this package looks for that hand.
Architecture
Every layer is sklearn-compatible; every Result carries full provenance (corpus hash,
feature hash, seed, spaCy version, timestamp, resolved config), so a study written as
study.yaml is reproducible to the exact random draw years later.
Install
uv pip install bitig
python -m spacy download en_core_web_trf
Optional extras:
uv pip install "bitig[bayesian]" # PyMC + arviz for hierarchical models
uv pip install "bitig[embeddings]" # sentence-transformers + contextual BERT
uv pip install "bitig[viz]" # plotly, kaleido, ete3
uv pip install "bitig[reports]" # weasyprint for PDF export
uv pip install "bitig[turkish]" # spacy-stanza + Stanza for Turkish pipelines
Quickstart
bitig init my-study
cd my-study
# drop .txt files into corpus/
# add metadata.tsv mapping filename → author, group, year, ...
bitig ingest corpus/ --metadata corpus/metadata.tsv
bitig info
bitig run study.yaml --name demo
bitig report results/demo --output results/demo/report.html
A complete beginner-friendly walkthrough using 9 Federalist Papers (including the disputed
No. 50) lives at examples/quickstart/. The full 85-paper analysis
reproducing the classic Mosteller & Wallace (1964) result is at
examples/federalist/.
Desktop GUI
If you'd rather click than write YAML, bitig ships a NiceGUI + pywebview desktop shell that walks the same workflow — Ingest → Study → Run → Results — plus a dedicated Forensic tab.
uv pip install "bitig[gui]"
bitig gui
This opens a native window with native file pickers; pass --no-native to fall back to a
browser tab. From the Study page you can pick the method (Burrows/Cosine/Argamon/…
Delta, PCA/MDS/t-SNE/UMAP, Ward/k-means/HDBSCAN, Zeta classic/Eder, bootstrap consensus,
classify, Bayesian) and the feature family (MFW, char/word n-grams, function words,
punctuation, lexical diversity, readability), set parameters, save study.yaml, run, and
view results — figures, parquet tables, and result.json scalars — in one place.
Example output
PCA on 200 MFW, trained on known-author Federalist Papers; the disputed #50 is projected into the same space and lands among the Madison cluster — matching the historical consensus.
Capabilities at a glance
| Layer | What's included |
|---|---|
| Languages | EN, TR, DE, ES, FR — first-class. Turkish via Stanford Stanza (BOUN) through spacy-stanza; the rest via official spaCy _trf pipelines. Per-language function words, readability formulas, and contextual/sentence embedding defaults |
| Corpus | .txt ingestion + TSV metadata, strict/lenient mode, filter, groupby, content-addressed hashing, language-stamped |
| Features | MFW, char n-grams, word n-grams, POS n-grams, dependency bigrams, function words, punctuation, readability (English 6 + TR/DE/ES/FR native), sentence length, lexical diversity (eight indices), sentence + contextual embeddings |
| Methods | Burrows / Eder / Argamon / Cosine / Quadratic Delta; Zeta (classic + Eder); PCA / MDS / UMAP / t-SNE; Ward / k-means / HDBSCAN; bootstrap consensus trees; sklearn classify with stylometry-aware CV; Bayesian Wallace-Mosteller + hierarchical group comparison |
| Forensic | General Impostors verification, Unmasking, Stamatatos distortion, Sapkota char-n-gram categories, CalibratedScorer, log-LR + C_llr + AUC + c@1 + F0.5u + ECE + Brier + Tippett, PANReport, chain-of-custody Provenance, LR-framed HTML report (ENFSI / Nordgaard verbal scale) |
| Output | Uniform Result record → JSON + Parquet + figures; Jinja2 HTML / Markdown report; publication-grade matplotlib with 300-DPI colourblind palette |
Multi-language support
Five first-class languages behind a single bitig.languages registry — English, Turkish,
German, Spanish, French. Language flows through Corpus.language and drives per-language
defaults for function words, readability formulas, and embedding models:
uv pip install "bitig[turkish]"
python -c "import stanza; stanza.download('tr')"
bitig init demo-tr --language tr
bitig ingest corpus/ --language tr --metadata corpus/metadata.tsv
bitig run study.yaml --name first-run
Turkish parsing goes through Stanford Stanza (BOUN treebank) wrapped by spacy-stanza — it
returns native spaCy Doc objects, so every bitig feature extractor works unchanged. Native
readability formulas are implemented for each language (Ateşman + Bezirci–Yılmaz for Turkish,
Flesch-Amstad + Wiener Sachtextformel for German, Fernández-Huerta + Szigriszt-Pazos for
Spanish, Kandel–Moles + LIX for French). Function-word lists are regenerated reproducibly from
Universal Dependencies treebanks via scripts/regenerate_function_words.py. See
docs/site/concepts/languages.md and the
Turkish tutorial.
Forensic toolkit
Forensic authorship research needs more than attribution — it needs one-class
verification, topic-invariant features, and evidential output framed as a
likelihood ratio. bitig.forensic ships these as a cohesive layer on top of the analysis
methods:
from bitig.forensic import (
GeneralImpostors, Unmasking, # verification
CategorizedCharNgramExtractor, # Sapkota 2015 topic-invariant features
distort_corpus, # Stamatatos 2013 content masking
CalibratedScorer, # Platt / isotonic calibration
compute_pan_report, # AUC + c@1 + F0.5u + Brier + ECE + (cllr)
)
from bitig.report import build_forensic_report # LR-framed report template
Every forensic method is classifier-agnostic — pair it with any bitig feature set and any
Delta / Zeta / classify method. Every Result can carry six optional chain-of-custody
metadata fields (questioned_description, known_description, hypothesis_pair,
acquisition_notes, custody_notes, source_hashes) so a report traces back to its source
material. See src/bitig/forensic/ for the full surface.
Documentation
Full MkDocs Material site — https://fatihbozdag.github.io/bitig/
- Getting started — install, five-command quickstart
- Concepts — Corpus / Features / Languages / Methods / Results & provenance
- Languages — EN / TR / DE / ES / FR registry, adding a sixth language
- Forensic toolkit — verification, calibration, topic-invariance, PAN evaluation, reporting
- Turkish tutorial — end-to-end Turkish authorship walkthrough
- PAN-CLEF verification tutorial — end-to-end runnable pipeline
- Federalist tutorial — reproduce Mosteller & Wallace (1964)
- CLI + API reference
Serve locally:
uv pip install "bitig[docs]"
mkdocs serve # http://127.0.0.1:8000
CI (.github/workflows/docs.yml) builds the site strictly on every push + PR and
deploys to GitHub Pages on every merge to main.
Status
Phase 5 landed — visualisation, Jinja2 reports, declarative runner (bitig run), and a
Rich-based interactive bitig shell.
Forensic phase landed — six additions (General Impostors, LR + calibration + evaluation metrics, Sapkota categories + Stamatatos distortion, Unmasking, chain-of-custody + forensic report template, PAN harness).
Multi-language phase landed — first-class support for English, Turkish, German, Spanish,
French behind a bitig.languages registry. Turkish parses through Stanford Stanza (BOUN
treebank) via spacy-stanza, returning native spaCy Doc objects so every feature extractor
works unchanged. Native readability formulas per language (Ateşman + Bezirci–Yılmaz for
Turkish, Flesch-Amstad + Wiener Sachtextformel for German, Fernández-Huerta + Szigriszt-Pazos
for Spanish, Kandel–Moles + LIX for French). Function-word lists generated reproducibly from
UD closed-class tokens.
Docs site landed — MkDocs Material site with Concepts, Forensic toolkit, Federalist + PAN-CLEF + Turkish tutorials, and CLI/API reference. 417 tests passing.
Docs site is multilingual — English (default) and Turkish (/tr/) launched via
mkdocs-static-i18n; DE/ES/FR infrastructure ready, translation content deferred.
Remaining — PyPI publish.
See docs/superpowers/specs/2026-04-17-bitig-stylometry-package-design.md for the full design.
License
BSD-3-Clause. See LICENSE.
Citation
If you use bitig in published work, please cite it — see CITATION.cff.
References
The forensic toolkit implements methods from the following peer-reviewed sources:
- Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. JASIST, 65(1), 178–187.
- Koppel, M., & Schler, J. (2004). Authorship verification as a one-class classification problem. Proceedings of ICML 2004, 489–495.
- Sapkota, U., Bethard, S., Montes-y-Gómez, M., & Solorio, T. (2015). Not all character n-grams are created equal. Proceedings of NAACL-HLT 2015, 93–102.
- Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), 421–439.
- Brümmer, N., & du Preez, J. (2006). Application-independent evaluation of speaker detection. Computer Speech & Language, 20(2–3), 230–275.
- Peñas, A., & Rodrigo, A. (2011). A simple measure to assess non-response. Proceedings of ACL-HLT 2011, 1415–1424.
- Platt, J. C. (1999). Probabilistic outputs for SVMs and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 61–74.
- ENFSI (2015). Guideline for evaluative reporting in forensic science; Nordgaard, A., Ansell, R., Drotz, W., & Jaeger, L. (2012). Scale of conclusions for the value of evidence. Law, Probability and Risk, 11(1), 1–24.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bitig-0.1.1.tar.gz.
File metadata
- Download URL: bitig-0.1.1.tar.gz
- Upload date:
- Size: 457.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a204adcba09615c628d3711e7af8fb9b3900da981c825f3ad79c64745aef1409
|
|
| MD5 |
852c122bb056573f13c45ce18a80be33
|
|
| BLAKE2b-256 |
bdd056634ede9aad4316bd454a3bbfc5b13f4549e2177e5aeef112d5c4c02b7a
|
File details
Details for the file bitig-0.1.1-py3-none-any.whl.
File metadata
- Download URL: bitig-0.1.1-py3-none-any.whl
- Upload date:
- Size: 156.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6b7539e45e48d448a576f646f5c6137846cdfe84f90c0b46a853fb299b0ebe3
|
|
| MD5 |
9d66c0bb44fef7dcec98c4f97914142e
|
|
| BLAKE2b-256 |
adf98b8cc0dc663af7ecaefae590463646ee5d2ae11f6745b08118f6cd6cc210
|