Skip to main content

A Python package for measuring the composition of complex datasets

Project description

alt text

sentropy: A Python package for measuring the composition of complex datasets

Python version Tests

About

sentropy calculates similarity-sensitive entropies (S-entropy), plus traditional Shannon entropy and the other Rényi entropies (of which Shannon entropy is the best known) as special cases.

  • Shannon entropy is a weighted sum of the relative probabilities of unique elements in a system (e.g. a dataset).
  • Rényi entropies generalize Shannon entropy by allowing for different weightings (viewpoint parameter q).
  • S-entropy generalizes Rényi entropies by incorporating elements' similarities and differences via a similarity matrix (often constructed using a similarity function).
  • Exponentiating entropy yields effective-number/D-number forms, which put entropies in the same, natural units—effective numbers—among other advantages.
  • sentropy calculates multiple S-entropic measures, including $\alpha, \beta/\rho, \gamma$ at both the subset (classes) level and for the overall (data)set

For more background, see Leinster 2020 and references therein.

Installation | Basic usage |

Installation

pip install sentropy

Basic usage

The workhorse function is sentropy.sentropy:

from sentropy import sentropy

sentropy's only required argument is a list-like object (e.g. a list, a numpy array) of relative frequencies P.

The most important optional arguments are:

  • similarity, which can be passed as a matrix or a function; the default is the identity matrix $I$
  • q, the viewpoint parameter; default is q=1.
  • measure, which can be alpha, beta, gamma, or others in the Leinster-Cobbold-Reeve (LCR) framework; the default is alpha
  • level, which can be overall (a.k.a. dataset) or subset (a.k.a. class); the default is overall

Vanilla Shannon entropy

When the similarity matrix is the identity matrix---sentropy's default for similarity---there is no similarity between elements $i\neq j$ and S-entropy reduces to traditional (Rényi) entropy. At the default q=1, this is Shannon entropy. Therefore passing sentropy only a P yields Shannon entropy, in effective-number form.

from sentropy import sentropy
import numpy as np

P = np.array([0.7, 0.3])      # two unique elements comprising 70% and 30% of the dataset, respectively
D1 = sentropy(P)              # S-entropy *without* similarity at default q (q=1) = Shannon entropy.
print(f"D1: {D1:.1f}")        # Note defaults: level="both", measure="alpha", q=1.

H1 = sentropy(P, eff_no=False)# traditional form (as an entropy, not an effective number)
print(f"H1: {H1:.1f}")

Shannon-type (i.e. q=1) S-entropy

Passing a non-$I$ similarity results in S-entropy.

from sentropy import sentropy
import numpy as np

P = np.array([0.7, 0.3])                      # same dataset as above
S = np.array([                                # similarity matrix
  [1. , 0.2],                                 # 20% similar to each other
  [0.2, 1. ],
  ])
D1Z = sentropy(P, similarity=S)               # D-number form (preferred). Note defaults: level="both", measure="alpha", q=1.
print(f"D1Z: {D1Z:.1f}")              

H1Z = sentropy(P, similarity=S, eff_no=False) # traditional form
print(f"H1Z: {H1Z:.1f}")

S-entropy with multiple measures and viewpoint parameters

To get results for multiple q (e.g. 0, 1, and $\infty$), multiple measures (e.g. alpha and beta), and/or both levels (overall and subset), pass a list-like object to the relevant argument; sentropy() returns an object with relevant values.

from sentropy import sentropy
import numpy as np

P = np.array([0.7, 0.3])                      # same dataset as above
S = np.array([                                # same similarity matrix as above
  [1. , 0.2],
  [0.2, 1. ],
  ])
qs = [0., 1., 2., np.inf]                     # multiple viewpoint parameters
ms = ["alpha", "beta", "gamma"]               # multiple measures
DZ = sentropy(P, similarity=S,                # S-entropy...
              qs=qs,                          #   ...at multple qs...
              ms=ms)                          #   ...for multiple measures
for q in qs:
  for m in ms:
    DqZ = DZ(q=q, m=m, which='overall')       # D-number form (preferred)
    HqZ = np.log(DqZ)                         # traditional form
    print(f"D{q}Z {m}: {DqZ:.1f}")
    print(f"H{q}Z {m}: {HqZ:.1f}")

Similarity on the fly

When the similarity matrix would be too large to hold in memory, a function can be passed to similarity.

from sentropy import sentropy
import numpy as np

# define a dataset consisting of two amino-acid sequences
elements = np.array(['CARDYW', 'CTRDYW'])
P = np.array([10, 1])                                   # the first is present 10 times; the second is present once

# define a similarity function where similarity decreases with edit distance between the sequences
from polyleven import levenshtein as edit_distance
def similarity_function(i, j):                          # i, j members of elements
    return 0.3**edit_distance(i, j)

# calculate datset sentropy (at the defaults meausure="alpha" and q=1.)
D1Z = sentropy(P, similarity=similarity_function,
               sfargs=elements)                         # sfargs contains arguments needed by the similarity_function
H1Z = np.log(D1Z)                                       # traditional form
print(f"D1Z: {D1Z:.1f}")
print(f"H1Z: {H1Z:.1f}")

How well each of two classes represents the whole dataset

Suppose you have a dataset of fruits that has two classes, apples and oranges, and you want to know how representative each class is of the whole dataset. Representativeness ($\rho$) is the reciprocal of beta diversity, which measures distinctiveness.

from sentropy import sentropy
import numpy as np

# a dataset with two classes, "apples" and "oranges"
C1 = np.array([5, 3, 0, 0])                   # apples; e.g. 5 McIntosh and 3 gala
C2 = np.array([0, 0, 6, 2])                   # oranges; e.g. 6 navel and 2 cara cara
P  = {"apples": C1, "oranges": C2}            # package the classes as P
S = np.array([                                # similarities of all elements, including between classes
  [1.,  0.8, 0.2, 0.1],                       #    note here the non-zero similarity between apples and oranges
  [0.8, 1.,  0.1, 0.3],
  [0.2, 0.1, 1.,  0.9],
  [0.1, 0.3, 0.9, 1. ],
  ])

D1Z = sentropy(P, similarity=S, level="subset",            # level="subset" is identical; an alias/synonym
               ms="normalized_rho")
R1 = D1Z(which="apples")                                   # note, no need to pass a measure to "m" or a viewpoint to "q"
R2 = D1Z(which="oranges")                                  # because D1Z only computed 1 measure and 1 viewpoint anyway
print(f"Normalized rho of class 1: {R1:.2f}")
print(f"Normalized rho of class 2: {R2:.2f}")

Relative S-entropies between two classes as a pandas DataFrame

Same dataset as above, except now results are returned as a dataframe. The similarity-sensitive version of traditional relative entropy at q=1 (a.k.a. Kullback-Leibler divergence, information divergence, etc.).

from sentropy import sentropy
import numpy as np

# a dataset with two classes, "apples" and "oranges"
C1 = np.array([5, 3, 0, 0])                   # apples; e.g. 5 McIntosh and 3 gala
C2 = np.array([0, 0, 6, 2])                   # oranges; e.g. 6 navel and 2 cara cara
P  = {"apples": C1, "oranges": C2}            # package the classes as P
S = np.array([                                # similarities of all elements, including between classes
  [1.,  0.8, 0.2, 0.1],                       #    note here the non-zero similarity between apples and oranges
  [0.8, 1.,  0.1, 0.3],
  [0.2, 0.1, 1.,  0.9],
  [0.1, 0.3, 0.9, 1. ],
  ])

D1Z = sentropy(P, similarity=S,
               return_dataframe=True)

display(D1Z)                              # S-entropies on the diagonals; relative S-entropies on the off-diagonals

Ordinariness

Suppose you have two datasets of animals. The first dataset consists of fish (a vertebrate) and ladybugs (an invertebrate). The second dataset consists of bees, butterflies, and lobsters—all invertebrates. The two datasets are disjoint—there are no fish or ladybugs in the second dataset—but genetically speaking, there are similarities. Suppose you want some measure of how similar each element of the first dataset is, to the second dataset: how much would each element "belong" in the second dataset. This is measured by ordinariness: ladybugs would be more "ordinary" in the second dataset, since it is an invertebrate. Strictly speaking this can be calculated without sentropy, but sentropy provides speedups (see documentation).

import numpy as np
P = np.array([5000, 2000, 3000])             # frequencies of a dataset of bees, butterflies, and lobsters, respectively
S_fish    = np.array([0.22, 0.27, 0.28])     # fish's genetic similarities to bee, butterfly, and lobster
S_ladybug = np.array([0.60, 0.55, 0.45])     # ladybug's genetic similarities to each of these
S = np.stack([S_fish, S_ladybug])
S @ (P/P.sum())                              # ordinariness of fish and ladybugs in the bees/butterflies/lobsters dataset

Availability and installation

sentropy is available on GitHub at https://github.com/ArnaoutLab/sentropy. It can be installed by running

pip install sentropy

from the command-line interface. The test suite runs successfully on Macintosh, Windows, and Unix systems. The unit tests (including a coverage report) can be run after installation by

pip install 'sentropy[tests]'
pytest --pyargs sentropy --cov sentropy

How to cite this work

If you use this package, please cite it as:

Nguyen et al., sentropy: A Python Package for Measuring The Composition of Complex Datasets. https://github.com/ArnaoutLab/diversity

Applications

For applications of the sentropy package to various fields (immunomics, metagenomics, medical imaging and pathology), we refer to the Jupyter notebooks below:

The examples in the Basic usage section are also made available as a notebook here. For more information, please see our preprint.

Alternatives

To date, we know of no other python package that implements the partitioned frequency- and similarity-sensitive diversity measures defined by Reeve at al.. However, there is a R package and a Julia package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentropy-0.1.1.tar.gz (832.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentropy-0.1.1-py3-none-any.whl (51.5 kB view details)

Uploaded Python 3

File details

Details for the file sentropy-0.1.1.tar.gz.

File metadata

  • Download URL: sentropy-0.1.1.tar.gz
  • Upload date:
  • Size: 832.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for sentropy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1fac68ed2e892ed3b0606516322940b413022810d099f8e7832b87c91f170f82
MD5 154eb9534022c6abb8fd53f0642b927a
BLAKE2b-256 e28f5851232c2c407075132c90c25710ff89edfe59f52ace8ada78fc5bd068f6

See more details on using hashes here.

File details

Details for the file sentropy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sentropy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 51.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for sentropy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 70a9a0e0b73dadeacbbd6447157c3c5a7cf3cd2edd9f505169466b176aed050c
MD5 a5db2e178f27cb11760b0e9a13b0d1ce
BLAKE2b-256 c46e35a36735784ffe4a6283e1c3481e6e68a05118a2181e026ff688ac3fec78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page