Python library for calculating the Burrows Delta.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Fast Stylometry Python library

my badge

Fast Stylometry - Burrows Delta

Developed by Fast Data Science, https://fastdatascience.com

Source code at https://github.com/fastdatascience/faststylometry

Tutorial at https://fastdatascience.com/fast-stylometry-python-library/

This is a lightweight Python library for finding drug names in a string, otherwise known as named entity recognition (NER) and named entity linking.

Python library for calculating the Burrows Delta.

Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as forensic stylometry.

A useful explanation of the maths and thinking behind Burrows' Delta and how it works

Installing Fast Stylometry Python package

You can install from PyPI.

pip install faststylometry

Usage examples

Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.

We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.

You can get the training corpus by cloning https://github.com/woodthom2/faststylometry, the data is in faststylometry/data.

Create a corpus

To create a corpus and add books, the pattern is as follows:

corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])

Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method util.load_corpus_from_folder(folder, pattern).

import os
import re

from faststylometry.corpus import Corpus

corpus = Corpus()
for root, _, files in os.walk(folder):
    for filename in files:
        if filename.endswith(".txt") and "_" in filename:
            with open(os.path.join(root, filename), "r", encoding="utf-8") as f:
                text = f.read()
            author, book = re.split("_-_", re.sub(r'\.txt', '', filename))

            corpus.add_book(author, book, text)

Example 1

Load a corpus and calculate Burrows' Delta

from faststylometry.util import load_corpus_from_folder
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta

train_corpus = load_corpus_from_folder("faststylometry/data/train")

train_corpus.tokenise(tokenise_remove_pronouns_en)

test_corpus_sense_and_sensibility = load_corpus_from_folder("faststylometry/data/test", pattern="sense")

test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)

calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)

returns a Pandas dataframe of Burrows' Delta scores

Example 2

Using the probability calibration functionality, you can calculate the probability of two books being by the same author.

from faststylometry.probability import predict_proba, calibrate
calibrate(train_corpus)
predict_proba(train_corpus, test_corpus_sense_and_sensibility)

outputs a Pandas dataframe of probabilities.

Who to contact

Thomas Wood at Fast Data Science

Contributing to the project

If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our Github repository. You can also raise an issue.

Developing the library

Automated tests

Test code is in tests/ folder using unittest.

The testing tool tox is used in the automation with GitHub Actions CI/CD.

Use tox locally

Install tox and run it:

pip install tox
tox

In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

tox -e py39

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.

Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

uses GitHub Actions for both testing and publishing
is tested when pushing master or main branch, and is published when create a release
includes test files in the source distribution
uses setup.cfg for version single-sourcing (setuptools 46.4.0+)

Re-releasing the package manually

The code to re-release Harmony on PyPI is as follows:

source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*

Who worked on the Fast Stylometry library?

The tool was developed:

Thomas Wood (Fast Data Science)

License of Fast Stylometry library

Citing the Fast Stylometry library

Wood, T.A., Fast Stylometry [Computer software], Version 1.0.0, accessed at https://fastdatascience.com/fast-stylometry-python-library, Fast Data Science Ltd (2023)

@unpublished{faststylometry,
    AUTHOR = {Wood, T.A.},
    TITLE  = {Fast Stylometry (Computer software), Version 1.0.0},
    YEAR   = {2023},
    Note   = {To appear},
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.4

Sep 15, 2023

1.0.3

Sep 15, 2023

1.0.2

Sep 15, 2023

1.0.1

Aug 1, 2023

This version

1.0.0

Aug 1, 2023

0.5

Jan 29, 2021

0.4

Jan 28, 2021

0.3

Jan 27, 2021

0.2

Jan 27, 2021

0.1

Jan 26, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faststylometry-1.0.0.tar.gz (12.8 kB view hashes)

Uploaded Aug 1, 2023 Source

Built Distribution

faststylometry-1.0.0-py3-none-any.whl (15.1 kB view hashes)

Uploaded Aug 1, 2023 Python 3

Hashes for faststylometry-1.0.0.tar.gz

Hashes for faststylometry-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ce10d71afcc674f4b953ca593b9cd19d9716c852ed04a5bc7985e9f2fe68b105`
MD5	`c91350e96b14c235443f33bb7ff8b26c`
BLAKE2b-256	`c1009113793f3d40c8bda12b6f0dfc64ddba0e85ee3e378a26f789cf22f55009`

Hashes for faststylometry-1.0.0-py3-none-any.whl

Hashes for faststylometry-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f70edfa53294da1dbac7537683bc86ead7dbd9177ecc07a992030c0d083a8ea`
MD5	`75a936e502068f3e4f6b1213ed40dbe0`
BLAKE2b-256	`ea6aac0c23015034b577ee55c42b9b8b361e5099650be7b270eb556d3f443533`