A Python package for measuring the composition of complex datasets
Project description
sentropy: A Python package for measuring the composition of complex datasets
About
sentropy calculates similarity-sensitive entropies (S-entropy), plus traditional Shannon entropy and the other Rényi entropies (of which Shannon entropy is the best known) as special cases.
- Shannon entropy is a weighted sum of the relative probabilities of unique elements in a system (e.g. a dataset).
- Rényi entropies generalize Shannon entropy by allowing for different weightings (viewpoint parameter q).
- S-entropy generalizes Rényi entropies by incorporating elements' similarities and differences via a similarity matrix (often constructed using a similarity function).
- Exponentiating entropy yields effective-number/D-number forms, which put entropies in the same, natural units—effective numbers—among other advantages.
sentropycalculates multiple S-entropic measures, including $\alpha, \beta/\rho, \gamma$ at both the subset (classes) level and for the overall (data)set
For more background, see Leinster 2020 and references therein.
Installation
pip install sentropy
Basic usage
The workhorse function is sentropy.sentropy:
from sentropy import sentropy
sentropy's only required argument is a list-like object (e.g. a list, a numpy array) of relative frequencies P.
The most important optional arguments are:
similarity, which can be passed as a matrix or a function; the default is the identity matrix $I$q, the viewpoint parameter; default isq=1.measure, which can bealpha,beta,gamma, or others in the Leinster-Cobbold-Reeve (LCR) framework; the default isalphalevel, which can beoverall(a.k.a.dataset) orsubset(a.k.a.class); the default isoverall
Vanilla Shannon entropy
When the similarity matrix is the identity matrix---sentropy's default for similarity---there is no similarity between elements $i\neq j$ and S-entropy reduces to traditional (Rényi) entropy. At the default q=1, this is Shannon entropy. Therefore passing sentropy only a P yields Shannon entropy, in effective-number form.
from sentropy import sentropy
import numpy as np
P = np.array([0.7, 0.3]) # two unique elements comprising 70% and 30% of the dataset, respectively
D1 = sentropy(P) # S-entropy *without* similarity at default q (q=1) = Shannon entropy.
print(f"D1: {D1:.1f}") # Note defaults: level="both", measure="alpha", q=1.
H1 = sentropy(P, eff_no=False)# traditional form (as an entropy, not an effective number)
print(f"H1: {H1:.1f}")
Shannon-type (i.e. q=1) S-entropy
Passing a non-$I$ similarity results in S-entropy.
from sentropy import sentropy
import numpy as np
P = np.array([0.7, 0.3]) # same dataset as above
S = np.array([ # similarity matrix
[1. , 0.2], # 20% similar to each other
[0.2, 1. ],
])
D1Z = sentropy(P, similarity=S) # D-number form (preferred). Note defaults: level="both", measure="alpha", q=1.
print(f"D1Z: {D1Z:.1f}")
H1Z = sentropy(P, similarity=S, eff_no=False) # traditional form
print(f"H1Z: {H1Z:.1f}")
S-entropy with multiple measures and viewpoint parameters
To get results for multiple q (e.g. 0, 1, and $\infty$), multiple measures (e.g. alpha and beta), and/or both levels (overall and subset), pass a list-like object to the relevant argument; sentropy() returns an object with relevant values.
from sentropy import sentropy
import numpy as np
P = np.array([0.7, 0.3]) # same dataset as above
S = np.array([ # same similarity matrix as above
[1. , 0.2],
[0.2, 1. ],
])
qs = [0., 1., 2., np.inf] # multiple viewpoint parameters
ms = ["alpha", "beta", "gamma"] # multiple measures
DZ = sentropy(P, similarity=S, # S-entropy...
qs=qs, # ...at multple qs...
ms=ms) # ...for multiple measures
for q in qs:
for m in ms:
DqZ = DZ(q=q, m=m, which='overall') # D-number form (preferred)
HqZ = np.log(DqZ) # traditional form
print(f"D{q}Z {m}: {DqZ:.1f}")
print(f"H{q}Z {m}: {HqZ:.1f}")
Similarity on the fly
When the similarity matrix would be too large to hold in memory, a function can be passed to similarity.
from sentropy import sentropy
import numpy as np
# define a dataset consisting of two amino-acid sequences
elements = np.array(['CARDYW', 'CTRDYW'])
P = np.array([10, 1]) # the first is present 10 times; the second is present once
# define a similarity function where similarity decreases with edit distance between the sequences
from polyleven import levenshtein as edit_distance
def similarity_function(i, j): # i, j members of elements
return 0.3**edit_distance(i, j)
# calculate datset sentropy (at the defaults meausure="alpha" and q=1.)
D1Z = sentropy(P, similarity=similarity_function,
sfargs=elements) # sfargs contains arguments needed by the similarity_function
H1Z = np.log(D1Z) # traditional form
print(f"D1Z: {D1Z:.1f}")
print(f"H1Z: {H1Z:.1f}")
How well each of two classes represents the whole dataset
Suppose you have a dataset of fruits that has two classes, apples and oranges, and you want to know how representative each class is of the whole dataset. Representativeness ($\rho$) is the reciprocal of beta diversity, which measures distinctiveness.
from sentropy import sentropy
import numpy as np
# a dataset with two classes, "apples" and "oranges"
C1 = np.array([5, 3, 0, 0]) # apples; e.g. 5 McIntosh and 3 gala
C2 = np.array([0, 0, 6, 2]) # oranges; e.g. 6 navel and 2 cara cara
P = {"apples": C1, "oranges": C2} # package the classes as P
S = np.array([ # similarities of all elements, including between classes
[1., 0.8, 0.2, 0.1], # note here the non-zero similarity between apples and oranges
[0.8, 1., 0.1, 0.3],
[0.2, 0.1, 1., 0.9],
[0.1, 0.3, 0.9, 1. ],
])
D1Z = sentropy(P, similarity=S, level="subset", # level="subset" is identical; an alias/synonym
ms="normalized_rho")
R1 = D1Z(which="apples") # note, no need to pass a measure to "m" or a viewpoint to "q"
R2 = D1Z(which="oranges") # because D1Z only computed 1 measure and 1 viewpoint anyway
print(f"Normalized rho of class 1: {R1:.2f}")
print(f"Normalized rho of class 2: {R2:.2f}")
Relative S-entropies between two classes as a pandas DataFrame
Same dataset as above, except now results are returned as a dataframe. The similarity-sensitive version of traditional relative entropy at q=1 (a.k.a. Kullback-Leibler divergence, information divergence, etc.).
from sentropy import sentropy
import numpy as np
# a dataset with two classes, "apples" and "oranges"
C1 = np.array([5, 3, 0, 0]) # apples; e.g. 5 McIntosh and 3 gala
C2 = np.array([0, 0, 6, 2]) # oranges; e.g. 6 navel and 2 cara cara
P = {"apples": C1, "oranges": C2} # package the classes as P
S = np.array([ # similarities of all elements, including between classes
[1., 0.8, 0.2, 0.1], # note here the non-zero similarity between apples and oranges
[0.8, 1., 0.1, 0.3],
[0.2, 0.1, 1., 0.9],
[0.1, 0.3, 0.9, 1. ],
])
D1Z = sentropy(P, similarity=S,
return_dataframe=True)
display(D1Z) # S-entropies on the diagonals; relative S-entropies on the off-diagonals
Ordinariness
Suppose you have two datasets of animals. The first dataset consists of fish (a vertebrate) and ladybugs (an invertebrate). The second dataset consists of bees, butterflies, and lobsters—all invertebrates. The two datasets are disjoint—there are no fish or ladybugs in the second dataset—but genetically speaking, there are similarities. Suppose you want some measure of how similar each element of the first dataset is, to the second dataset: how much would each element "belong" in the second dataset. This is measured by ordinariness: ladybugs would be more "ordinary" in the second dataset, since it is an invertebrate. Strictly speaking this can be calculated without sentropy, but sentropy provides speedups (see documentation).
import numpy as np
P = np.array([5000, 2000, 3000]) # frequencies of a dataset of bees, butterflies, and lobsters, respectively
S_fish = np.array([0.22, 0.27, 0.28]) # fish's genetic similarities to bee, butterfly, and lobster
S_ladybug = np.array([0.60, 0.55, 0.45]) # ladybug's genetic similarities to each of these
S = np.stack([S_fish, S_ladybug])
S @ (P/P.sum()) # ordinariness of fish and ladybugs in the bees/butterflies/lobsters dataset
Availability and installation
sentropy is available on GitHub at https://github.com/ArnaoutLab/sentropy. It can be installed by running
pip install sentropy
from the command-line interface. The test suite runs successfully on Macintosh, Windows, and Unix systems. The unit tests (including a coverage report) can be run after installation by
pip install 'sentropy[tests]'
pytest --pyargs sentropy --cov sentropy
How to cite this work
If you use this package, please cite it as:
Nguyen et al., sentropy: A Python Package for Measuring The Composition of Complex Datasets. https://github.com/ArnaoutLab/diversity
Applications
For applications of the sentropy package to various fields (immunomics, metagenomics, medical imaging and pathology), we refer to the Jupyter notebooks below:
The examples in the Basic usage section are also made available as a notebook here. For more information, please see our preprint.
Alternatives
To date, we know of no other python package that implements the partitioned frequency- and similarity-sensitive diversity measures defined by Reeve at al.. However, there is a R package and a Julia package.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentropy-0.1.1.tar.gz.
File metadata
- Download URL: sentropy-0.1.1.tar.gz
- Upload date:
- Size: 832.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fac68ed2e892ed3b0606516322940b413022810d099f8e7832b87c91f170f82
|
|
| MD5 |
154eb9534022c6abb8fd53f0642b927a
|
|
| BLAKE2b-256 |
e28f5851232c2c407075132c90c25710ff89edfe59f52ace8ada78fc5bd068f6
|
File details
Details for the file sentropy-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sentropy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 51.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70a9a0e0b73dadeacbbd6447157c3c5a7cf3cd2edd9f505169466b176aed050c
|
|
| MD5 |
a5db2e178f27cb11760b0e9a13b0d1ce
|
|
| BLAKE2b-256 |
c46e35a36735784ffe4a6283e1c3481e6e68a05118a2181e026ff688ac3fec78
|