Skip to main content

This is a reimplementation of fundamental functionalities of 'Pydistinto', a software project developed at the [TCDH](https://tcdh.uni-trier.de/).

Project description

PyDistintoX

Overview

Zeta SD2

NFC

Heatmap

This is a reimplementation of fundamental functionalities of "Pydistinto", a software project developed at the TCDH. The original script can be found here. The following major changes are worth mentioning:

  • using Gensim and NumPy instead of Pandas for the construction of term-document matrices; Gensim can parse corpora that do not fit into RAM, NumPy/SciPy is used to process sparse matrices with lower memory requirements,

  • using the inbuilt tf-idf functions of Gensim for adding more tf-idf related measures of distinctiveness (see below),

This is a lite version of the original Pydistinto since we offer less features, f.e. no randomization of texts. However, if your corpus reaches a size that makes it difficult for the original Pydistinto to cope, it may be worth to try out our solution.

PyDistintoX was developed by Leon Glüsing and Stefan Heßbrüggen-Walter. Original development was funded by the Deutsche Forschungsgemeinschaft via the project "Prodatphil: Science and Logic", project no. 537184692.


Overview

PyDistintoX can be used in two ways:

Furthermore, we support the installation via uv as well via pip.

Prerequisites

Choose either:

uv (recommended) Pure Python
install here >= 3.11.1, < 3.13

Note: Replace python in the commands below with your specific version if needed.

Clone the Repository

Open a terminal and clone the repository:

git clone https://gitlab.com/leongluesing/pydistintox.git
cd pydistintox

Alternatively, you may download the code directly and go in a terminal into the project directory.

Standalone Application

Run analyses directly from the command line.

click to expand

Installation

Download spacy model of your choice. You need to know the model name when using the CLI later as well.

uv (recommended)

Install dependencies and activate virtual environment

uv venv
source .venv/bin/activate
uv pip install -e .

(use .\.venv\Scripts\activate on windows to activate environment)

Download spacy model of your choice:

uv pip install $(spacy info <MODEL_NAME> --url) 

On Windows use uv pip install (spacy info <MODEL_NAME> --url) instead.

pip

Create virtual environment (if it doesn't already exist)

python -m venv .venv
source .venv/bin/activate

On Windows, use .\.venv\Scripts\activate to activate environment

Now install PyDistintoX and the spaCy model of your choice

pip install -e .
pip install $(spacy info <MODEL_NAME> --url)

Quickstart

uv run pydistintox --example

or respectively

python -m pydistintox --example

runs the application with texts of Arthur Conan Doyle that are part of the installation. You find them in data/texts/example.


Usage

In the example data, Doyle's detective novels are compared to his other novels in order to identify words that are distinctive for this genre. You can either:

  • place your target corpus files in data/texts/tar and your reference corpus files in data/texts/ref, or
  • specify custom directories via command line options (see Command Line Options below).

The application processes your texts in three steps:

  1. NLP Processing: Texts are tokenized and lemmatized. If --save-nlp flag is set, the results are saved as JSON in data/interim/json.
  2. Statistical Analysis: Various distinctiveness measures are calculated.
  3. Results Visualization: The results are automatically displayed in your default browser.

If you already created the JSON files (f.e. from a previous run), you can skip the first step by setting the --load-nlp flag (see Command Line Options below).

You can run the application via either:

uv (recommended) pip
uv run pydistintox python -m pydistintox

Command Line Options

  • --debug changes logging mode to verbose
  • --load-nlp path/to/json/dir skips tokenization and lemmatization, and load nlp results from path. Make sure path/to/json/dir contains the folders tar and ref. If no path is specified, the default data/interim/json is used.
  • --example Use the example data included with the installation to explore the program.
  • --input-tar path/to/target/corpus specify the directory for target corpus. Give the directory, not a path to a file! (Escape spaces in the path if necessary)
  • --input-ref path/to/reference/corpus specify the directory for reference corpus. Give the directory, not a path to a file! (Escape spaces in the path if necessary)
  • --model spacy_model_name spaCy model name in the format: {lang}_core_{dataset}_{size}. Example: en_core_web_sm (lang=en, dataset=web, size=sm). Available sizes: sm, md, lg, trf. Find models here.
  • --raw-scores Scores will not be scaled to -1,1.

Python library

Import functions for custom workflows.

click to expand

Installation

  • If you are not using a virtual environment yet: Create one:

    • uv venv
    • python -m venv .venv

    Activate it with:

    • source .venv/bin/activate (Linux/macOS)
    • .\.venv\Scripts\activate (Windows)

Then run either:

uv (recommended) pip
Install uv pip install . pip install .
Download Model uv pip install $(spacy info <MODEL_NAME> --url) pip install $(spacy info <MODEL_NAME> --url)
Check Installation You can see the place of installation via uv pip show pydistintox. You can see the place of installation via pip show pydistintox.

Quickstart

Installation

# Assuming you do not have an active virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# and you are in the root directory of the pydistintox
# project. You may install it via
pip install .
# You can check your installation via 
pip show pydistintox

# download spacy model
uv pip install $(spacy info en_core_web_sm --url)

Usage

from pathlib import Path
from pydistintox import *

config = Config(
    skip_nlp = False,
    save_nlp= JSON,
    spacy_model_name='en_core_web_sm',
    target=INPUT_TAR,
    reference=INPUT_REF,
    debug = True,
    source='YOUR COMMENT HERE',
    measures_to_calculate = [],  # only tf_idf_measures_to_calculate put here; for non-tf-idf not implemented yet
    scaling = True,
)
non_tf_idf_measures_to_calculate = ['zeta_sd2']  # only calculate zeta_sd2

...

For a complete example, see the script: demo.py

Available Import Functions

Key functions for programmatic use (import via from pydistintox import ...). Full documentation: FUNCTIONS.md


Remove PyDistintoX

uv pip
uninstall uv pip uninstall pydistintox pip uninstall pydistintox
clean cache uv cache clean pip cache purge
remove model uv pip uninstall en_core_web_sm pip uninstall en_core_web_sm

You may delete the virtual environment completely. Thereby all dependencies and downloaded models will be deleted as well: First deactivate the virtual environment (deactivate) and then delete its folder (rm -r .venv or rmdir /s .venv on windows).


Troubleshooting

Installation of spaCy Model

When command not found: spacy pops up, this means pydistintox is not correctly installed in your virtual environment yet (or the v.e. is not active). Try

source .venv/bin/activate
(uv) pip install .
(uv) pip install $(spacy info <MODEL_NAME> --url) 

If the installtion of a spacy model via (uv) pip install <MODEL_NAME> still fails, you may download the model by some other means and run (uv) pip install path/to/model instead.


Technical Reference

The distinctiveness measures provided are scaled to the range of [-1, 1] by default for visualization, but this scaling does not make them statistically comparable. Each measure is based on different underlying assumptions, mathematical formulations, and distributions (e.g., frequency-based vs. dispersion-based). Direct comparison of values across measures is not meaningful, unless one explicitly accounts for the differing mathematical foundations.

Non-TF-IDF Measures

Using the gensim tf-idf-model as helper function, PyDistintoX calculates absolute and binary frequencies. This is tf-idf divided by 1, i. e. disregarding document frequency, so strictly speaking tf-no-idf. Based on this data, the following distinct measures are calculated.

The following table is adapted from the foundational article on which this software is based: "Evaluation of Measures of Distinctiveness", by Keli Du, Julia Dudar, Christof Schöch (Wayback Machine). For more information see here (Wayback Machine)

Name Type of measure References Evaluated in Implementation Key
TF-IDF Term weighting Luhn 1957; Spärck Jones 1972 Salton and Buckley 1988
Ratio of relative frequencies (RRF) Frequency-based Damerau 1993 Gries 2010 rrf_dr0
Chi-squared test (χ²) Frequency-based Dunning 1993 Lijffijt et al. 2014 chi_square_value
Log-likelihood ratio test (LLR) Frequency-based Dunning 1993 Egbert and Biber 2019; Paquot and Bestgen 2009; Lijffijt et al. 2014 LLR_value
Welch’s t-test (Welch) Distribution-based Welch 1947 Paquot and Bestgen 2009; Lijffijt et al. 2014 welch_t_value
Wilcoxon rank sum test (Wilcoxon) Dispersion-based Wilcoxon 1945; Mann and Whitney 1947 Paquot and Bestgen 2009; Lijffijt et al. 2014 ranksumtest_value
Burrows Zeta (Zeta_orig) Dispersion-based Burrows 2007; Craig and Kinney 2009 Schöch 2018 zeta_sd0
logarithmic Zeta (Zeta_log) Dispersion-based Schöch 2018 Schöch 2018; Du et al. 2021 zeta_sd2
Eta Dispersion-based Du et al. 2021 Du et al. 2021 eta_sg0

TF-IDF Measures

PyDistintoX uses the gensim tf-idf model for calculating distinctness scores. The following parameters are used:

  • 'nfn',
  • 'nfc',
  • 'bfn',
  • 'afn',
  • 'lfn',
  • 'ltc'

For more information on the meaning of these parameters see wikipedia Wayback Machine, or the following paragraph:

Explanation from the Gensim Documentation

From the gensim documentation:

smartirs (str, optional) –

SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.

Term frequency weighing:

  • b - binary,
  • t or n - raw,
  • a - augmented,
  • l - logarithm,
  • d - double logarithm,
  • L - log average.

Document frequency weighting:

  • x or n - none,
  • f - idf,
  • t - zero-corrected idf,
  • p - probabilistic idf.

Document normalization:

  • x or n - none,
  • c - cosine,
  • u - pivoted unique,
  • b - pivoted character length.

Default is ‘nfc’. For more information visit SMART Information Retrieval System.

SpaCy and Gensim

While gensim features its own rule-based tokenizer, we provide the opportunity to plug-in spaCy models for tokenization and lemmatization. Results of this process are saved as json files in spaCy's own format. We then extract lemmas from these data and recast them as Gensim input documents (basically list of words forming sentences which in turn are combined in a list to form a document).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydistintox-0.1.0.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydistintox-0.1.0-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file pydistintox-0.1.0.tar.gz.

File metadata

  • Download URL: pydistintox-0.1.0.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydistintox-0.1.0.tar.gz
Algorithm Hash digest
SHA256 81e7c73b3415e7c690aff7e46f06707b5a66010f23cb1012f5429da1f9fcc6aa
MD5 0c47b3ca8e2acd1c29d0d7629f6e9b3b
BLAKE2b-256 f3b24055162a6312bab4cffbf94c4b48ab77a5c0fec43f9328244e8a4e69147f

See more details on using hashes here.

File details

Details for the file pydistintox-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pydistintox-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydistintox-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7f8f08165e2364e8d4ca207a3b067510a7c2ebaa58fa95f7970987e6b18a51ca
MD5 e0c2d49bd511fd84b6f36f095bba2605
BLAKE2b-256 c176ae5bbeb567571f7e757051efc9323f558757601e93b5374285838e9d2034

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page