bitig

Next-generation computational stylometry — a Python replacement for R's Stylo.

These details have not been verified by PyPI

Project links

Project description

bitig — computational stylometry

status tests languages

bitig ("writing, inscription, charter" — from Old Turkic) is a Python package and interactive CLI for authorship attribution, author-group style comparison, and forensic-linguistic analysis. It reimplements the analytical breadth of R's Stylo, then adds a modern NLP pipeline (spaCy, transformer embeddings), a Bayesian layer (PyMC), and a full forensic-evidential toolkit on top.

Named after the bitig, the Turkic word for writing / inscription — the kind chiselled into the 8th-century Orkhon stelae. A bitig was a recorded text bearing a writer's hand; this package looks for that hand.

Architecture

bitig architecture: corpus → features → methods → forensic → output

Every layer is sklearn-compatible; every Result carries full provenance (corpus hash, feature hash, seed, spaCy version, timestamp, resolved config), so a study written as study.yaml is reproducible to the exact random draw years later.

Install

uv pip install bitig
python -m spacy download en_core_web_trf

Optional extras:

uv pip install "bitig[bayesian]"    # PyMC + arviz for hierarchical models
uv pip install "bitig[embeddings]"  # sentence-transformers + contextual BERT
uv pip install "bitig[viz]"         # plotly, kaleido, ete3
uv pip install "bitig[reports]"     # weasyprint for PDF export
uv pip install "bitig[turkish]"     # spacy-stanza + Stanza for Turkish pipelines

Quickstart

bitig init my-study
cd my-study
# drop .txt files into corpus/
# add metadata.tsv mapping filename → author, group, year, ...
bitig ingest corpus/ --metadata corpus/metadata.tsv
bitig info
bitig run study.yaml --name demo
bitig report results/demo --output results/demo/report.html

A complete beginner-friendly walkthrough using 9 Federalist Papers (including the disputed No. 50) lives at examples/quickstart/. The full 85-paper analysis reproducing the classic Mosteller & Wallace (1964) result is at examples/federalist/.

Desktop GUI

If you'd rather click than write YAML, bitig ships a NiceGUI + pywebview desktop shell that walks the same workflow — Ingest → Study → Run → Results — plus a dedicated Forensic tab.

uv pip install "bitig[gui]"
bitig gui

This opens a native window with native file pickers; pass --no-native to fall back to a browser tab. From the Study page you can pick the method (Burrows/Cosine/Argamon/… Delta, PCA/MDS/t-SNE/UMAP, Ward/k-means/HDBSCAN, Zeta classic/Eder, bootstrap consensus, classify, Bayesian) and the feature family (MFW, char/word n-grams, function words, punctuation, lexical diversity, readability), set parameters, save study.yaml, run, and view results — figures, parquet tables, and result.json scalars — in one place.

Example output

PCA on 200 MFW, trained on known-author Federalist Papers; the disputed #50 is projected into the same space and lands among the Madison cluster — matching the historical consensus.

PCA of Hamilton vs Madison Federalist papers; disputed paper #50 projected as "Unknown"

Capabilities at a glance

Layer	What's included
Languages	EN, TR, DE, ES, FR — first-class. Turkish via Stanford Stanza (BOUN) through `spacy-stanza`; the rest via official spaCy `_trf` pipelines. Per-language function words, readability formulas, and contextual/sentence embedding defaults
Corpus	`.txt` ingestion + TSV metadata, strict/lenient mode, `filter`, `groupby`, content-addressed hashing, language-stamped
Features	MFW, char n-grams, word n-grams, POS n-grams, dependency bigrams, function words, punctuation, readability (English 6 + TR/DE/ES/FR native), sentence length, lexical diversity (eight indices), sentence + contextual embeddings
Methods	Burrows / Eder / Argamon / Cosine / Quadratic Delta; Zeta (classic + Eder); PCA / MDS / UMAP / t-SNE; Ward / k-means / HDBSCAN; bootstrap consensus trees; sklearn classify with stylometry-aware CV; Bayesian Wallace-Mosteller + hierarchical group comparison
Forensic	General Impostors verification, Unmasking, Stamatatos distortion, Sapkota char-n-gram categories, CalibratedScorer, log-LR + C_llr + AUC + c@1 + F0.5u + ECE + Brier + Tippett, PANReport, chain-of-custody Provenance, LR-framed HTML report (ENFSI / Nordgaard verbal scale)
Output	Uniform `Result` record → JSON + Parquet + figures; Jinja2 HTML / Markdown report; publication-grade matplotlib with 300-DPI colourblind palette

Multi-language support

Five first-class languages behind a single bitig.languages registry — English, Turkish, German, Spanish, French. Language flows through Corpus.language and drives per-language defaults for function words, readability formulas, and embedding models:

uv pip install "bitig[turkish]"
python -c "import stanza; stanza.download('tr')"
bitig init demo-tr --language tr
bitig ingest corpus/ --language tr --metadata corpus/metadata.tsv
bitig run study.yaml --name first-run

Turkish parsing goes through Stanford Stanza (BOUN treebank) wrapped by spacy-stanza — it returns native spaCy Doc objects, so every bitig feature extractor works unchanged. Native readability formulas are implemented for each language (Ateşman + Bezirci–Yılmaz for Turkish, Flesch-Amstad + Wiener Sachtextformel for German, Fernández-Huerta + Szigriszt-Pazos for Spanish, Kandel–Moles + LIX for French). Function-word lists are regenerated reproducibly from Universal Dependencies treebanks via scripts/regenerate_function_words.py. See docs/site/concepts/languages.md and the Turkish tutorial.

Forensic toolkit

Forensic authorship research needs more than attribution — it needs one-class verification, topic-invariant features, and evidential output framed as a likelihood ratio. bitig.forensic ships these as a cohesive layer on top of the analysis methods:

from bitig.forensic import (
    GeneralImpostors, Unmasking,        # verification
    CategorizedCharNgramExtractor,      # Sapkota 2015 topic-invariant features
    distort_corpus,                     # Stamatatos 2013 content masking
    CalibratedScorer,                   # Platt / isotonic calibration
    compute_pan_report,                 # AUC + c@1 + F0.5u + Brier + ECE + (cllr)
)
from bitig.report import build_forensic_report  # LR-framed report template

Every forensic method is classifier-agnostic — pair it with any bitig feature set and any Delta / Zeta / classify method. Every Result can carry six optional chain-of-custody metadata fields (questioned_description, known_description, hypothesis_pair, acquisition_notes, custody_notes, source_hashes) so a report traces back to its source material. See src/bitig/forensic/ for the full surface.

Documentation

Full MkDocs Material site — https://fatihbozdag.github.io/bitig/

Getting started — install, five-command quickstart
Concepts — Corpus / Features / Languages / Methods / Results & provenance
Languages — EN / TR / DE / ES / FR registry, adding a sixth language
Forensic toolkit — verification, calibration, topic-invariance, PAN evaluation, reporting
Turkish tutorial — end-to-end Turkish authorship walkthrough
PAN-CLEF verification tutorial — end-to-end runnable pipeline
Federalist tutorial — reproduce Mosteller & Wallace (1964)
CLI + API reference

Serve locally:

uv pip install "bitig[docs]"
mkdocs serve             # http://127.0.0.1:8000

CI (.github/workflows/docs.yml) builds the site strictly on every push + PR and deploys to GitHub Pages on every merge to main.

Status

Phase 5 landed — visualisation, Jinja2 reports, declarative runner (bitig run), and a Rich-based interactive bitig shell.

Forensic phase landed — six additions (General Impostors, LR + calibration + evaluation metrics, Sapkota categories + Stamatatos distortion, Unmasking, chain-of-custody + forensic report template, PAN harness).

Multi-language phase landed — first-class support for English, Turkish, German, Spanish, French behind a bitig.languages registry. Turkish parses through Stanford Stanza (BOUN treebank) via spacy-stanza, returning native spaCy Doc objects so every feature extractor works unchanged. Native readability formulas per language (Ateşman + Bezirci–Yılmaz for Turkish, Flesch-Amstad + Wiener Sachtextformel for German, Fernández-Huerta + Szigriszt-Pazos for Spanish, Kandel–Moles + LIX for French). Function-word lists generated reproducibly from UD closed-class tokens.

Docs site landed — MkDocs Material site with Concepts, Forensic toolkit, Federalist + PAN-CLEF + Turkish tutorials, and CLI/API reference. 417 tests passing.

Docs site is multilingual — English (default) and Turkish (/tr/) launched via mkdocs-static-i18n; DE/ES/FR infrastructure ready, translation content deferred.

Remaining — PyPI publish.

See docs/superpowers/specs/2026-04-17-bitig-stylometry-package-design.md for the full design.

License

BSD-3-Clause. See LICENSE.

Citation

If you use bitig in published work, please cite it — see CITATION.cff.

References

The forensic toolkit implements methods from the following peer-reviewed sources:

Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. JASIST, 65(1), 178–187.
Koppel, M., & Schler, J. (2004). Authorship verification as a one-class classification problem. Proceedings of ICML 2004, 489–495.
Sapkota, U., Bethard, S., Montes-y-Gómez, M., & Solorio, T. (2015). Not all character n-grams are created equal. Proceedings of NAACL-HLT 2015, 93–102.
Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), 421–439.
Brümmer, N., & du Preez, J. (2006). Application-independent evaluation of speaker detection. Computer Speech & Language, 20(2–3), 230–275.
Peñas, A., & Rodrigo, A. (2011). A simple measure to assess non-response. Proceedings of ACL-HLT 2011, 1415–1424.
Platt, J. C. (1999). Probabilistic outputs for SVMs and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 61–74.
ENFSI (2015). Guideline for evaluative reporting in forensic science; Nordgaard, A., Ansell, R., Drotz, W., & Jaeger, L. (2012). Scale of conclusions for the value of evidence. Law, Probability and Risk, 11(1), 1–24.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 4, 2026

0.1.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bitig-0.1.1.tar.gz (457.5 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bitig-0.1.1-py3-none-any.whl (156.5 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file bitig-0.1.1.tar.gz.

File metadata

Download URL: bitig-0.1.1.tar.gz
Upload date: May 4, 2026
Size: 457.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bitig-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a204adcba09615c628d3711e7af8fb9b3900da981c825f3ad79c64745aef1409`
MD5	`852c122bb056573f13c45ce18a80be33`
BLAKE2b-256	`bdd056634ede9aad4316bd454a3bbfc5b13f4549e2177e5aeef112d5c4c02b7a`

See more details on using hashes here.

File details

Details for the file bitig-0.1.1-py3-none-any.whl.

File metadata

Download URL: bitig-0.1.1-py3-none-any.whl
Upload date: May 4, 2026
Size: 156.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bitig-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6b7539e45e48d448a576f646f5c6137846cdfe84f90c0b46a853fb299b0ebe3`
MD5	`9d66c0bb44fef7dcec98c4f97914142e`
BLAKE2b-256	`adf98b8cc0dc663af7ecaefae590463646ee5d2ae11f6745b08118f6cd6cc210`

See more details on using hashes here.

bitig 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Architecture

Install

Quickstart

Desktop GUI

Example output

Capabilities at a glance

Multi-language support

Forensic toolkit

Documentation

Status

License

Citation

References

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes