No project description provided
Project description
Entropy-conVergence Metric
The following is a full implementation of the convergence-entropy measurement framework (here referred to as Entropy-conVergence Metric, or EVM) as described in Rosen and Dale 2023. We've taken strides to make this package as easy to implement as possible.
At its core there are two things that researchers need to define for the most basic version of the EVM algorithm. Those are:
- A word vector model (we suggest those implemented in the HuggingFace transformer library)
- An entropy object
Demonstrating how this would look in code:
from EVM import EVM, languageModelLayers
from transformers import AutoTokenizer, AutoModel
wv = languageModelLayers('roberta-base', layers=[7])
mod = EVM(
word_vector_model = wv
)
mod('No political action can be taken without first consulting the populace', 'it is the prerogative of the political class to make decisions'), \
mod('No political action can be taken without first consulting the populace', 'Ultimately the populace has to have the right to decide their own fate')
The above returns the entropy for the first sentence upon having read the second sentence, consistent with the equations and formulation listed in Rosen and Dale 2023.
End-to-End data exploration using Convergence-Entropy
The Convergence Entropy Data Analysis package (CEDA) is an end-to-end shell designed to facilitate GPU assisted, fast analyses. Implementationally, it is quite simple.
from CEDA import ceda_model
GRAPH = ceda_model(
sigma=1.,
device='cuda',
wv_model='roberta-base',
wv_layers=[7]
)
Where wv_model
is parameter for the string name of any model available from the HuggingFace library of language models, and level selects which hidden layers to attenuate to when producing a representation of lexical units.
On run-time, one can pass to the model a list of strings for both utterances $x$ and $y$ (see Rosen & Dale 2023 for terminology).
x_sentences = ['a list of', 'sentences']
y_sentences = ['another list', 'of more sentences']
GRAPH.fit(x_sentences, y_sentences)
GRAPH.add_meta_data(
meta_data = [
{'a': 1, 'b': 2}, # a records oriented json object for metadata for each comparison in the graph.
# Not required.
]
)
If there are repeated sentences in x_sentence
or y_sentences
it is worthwhile to sort the lists so as to cluster repetitions. This is because the model is designed to optimize for repeated examples by only generating vector representations for an utterance once in either the $x$ or $y$ variable so long as that element is repeated. Thus if you have a list of comparisons
x_sentences = ['Sentence A', 'Sentence A', 'Sentence B']
y_sentences = ['sentence c', 'sentence d', 'sentence e']
ceda_model
will generate vectors for 'Sentence A' once, until it reaches 'Sentence B' in x_sentences
.
This creates a graph that can be used for any number of convergence-entropy based analyses, complete with metadata.
Finally, one can save a checkpoint for the graph using the checkpoint function
CKPT_PATH = 'path/to/checkpoint.pt'
GRAPH.checkpoint(CKPT_PATH)
Saving a graph using the .checkpoint()
has the added benefit, too, of allowing you to load a CEDA graph object from that checkpoint later.
CKPT_PATH = 'path/to/checkpoint.pt'
GRAPH.load_from_checkpoint(CKPT_PATH)
You can also create a dataframe for the graph to be saved as a .csv or other format.
CKPT_PATH = 'path/to/checkpoint.csv'
df = GRAPH.graph_df()
df.to_csv(CKPT_PATH, index=False, encoding='utf-8')
A number of built-in visualization tools are included as well in the EDA
package. We will expand on that documentation at a later date.
Citation for usage
Any use of this package should cite it using the following:
@article{rosen_dale_berts_2023,
title = {{BERTs} of a feather: {Studying} inter- and intra-group communication via information theory and language models},
shorttitle = {{BERTs} of a feather},
url = {https://link.springer.com/10.3758/s13428-023-02267-2},
doi = {10.3758/s13428-023-02267-2},
journal = {Behavior Research Methods},
author = {Rosen, Zachary P and Dale, Rick},
year = {2023},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file convergence_entropy_metric-1.0.2.tar.gz
.
File metadata
- Download URL: convergence_entropy_metric-1.0.2.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b3a2a0c8de1d93c471795abde5f5ee9b842b9e1283aaf10c4c065e7b10eadb9 |
|
MD5 | 08747bf5bfd5223c0424883f56eec747 |
|
BLAKE2b-256 | 4e0f8fdac222db2af15c78a580571a1a6719ab1fc55417ed713b4ebeb85b4777 |
File details
Details for the file convergence_entropy_metric-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: convergence_entropy_metric-1.0.2-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 261830a7114b3852749f9c2a40558c5291adc8bcae492648b08afbc9652b27c8 |
|
MD5 | 48572860c60807fcad167ed2508f0919 |
|
BLAKE2b-256 | 016e3576f8ccfa94fb65d2a4bf4f71ae0bc7a8e6c185860e46a67275690b638f |