Same Text, Different Context: A package for visualizing contextual word embeddings

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

ST-DC: Same Text, Different Context

Same Text, Different Context is a Python package for visualizing contextual word embeddings and their neighbors across different contexts. It helps you understand how the meaning of a word shifts depending on its surrounding text. This is achieved by leveraging pretrained language models (like BERT) to extract embeddings for focus words and their context-specific neighbors, and visualizing them in 2D or 3D.

How It Works

Here's an quick explanation of how ST-DC works:

Contextual Embedding Extraction:
- The package uses a pretrained language model (e.g., BERT) to generate contextual embeddings for a focus word (e.g., pool) in various sentences (contexts).
- For example:
  - Sentence 1: "The pool is open for swimming."
  - Sentence 2: "The pool of candidates is very competitive."
- In each sentence, the focus word's embedding represents its meaning in that specific context.
Neighbor Prediction:
- The tool predicts words most semantically related to the focus word in each context using the model's [MASK] functionality.
- For example:
  - In the sentence "The [MASK] is open for swimming," neighbors might include water, swim, and diving.
  - In the sentence "The [MASK] of candidates is very competitive," neighbors might include selection, group, and list.
Dimensionality Reduction:
- Embeddings are originally high-dimensional (e.g., 768 dimensions for BERT).
- To make visualization possible, these embeddings are reduced to 2D or 3D using techniques like:
  - PCA: Preserves as much variance as possible.
  - t-SNE: Captures local relationships between points.
Visualization:
- Embeddings for the focus word and its neighbors are plotted in 2D or 3D space:
  - Focus Word Embeddings: Each context is represented as a unique point, color-coded for distinction.
  - Neighbor Words:Words semantically related to the focus word in each context are plotted near the corresponding focus word. This visualization helps you explore how the meaning of the focus word varies between contexts.

Features

Extract embeddings: Analyze how the meaning of a word changes in different contexts.
Visualize embeddings: Explore embeddings in 2D or 3D space using dimensionality reduction.
Neighbor words: Display words closely related to the focus word for each context.

Installation

To install ST-DC, run:

pip install st-dc

Requirements

Requires Python 3.11

Usage

Example 1: Visualizing Word "Pool"

from context_explorer import viz
from samples import sentences_dict

# Define the focus word and its contexts
focus_word = "pool"
sentences = [
  "The pool is open for swimming.", 
  "The pool of candidates is very competitive."
  ]

# Visualize embeddings
viz(
    word=focus_word,
    sentences=sentences,
    dim_technique="pca",  # Dimensionality reduction technique (e.g., "pca", "tsne")
    num_neighbors=5,      # Number of neighbors to display
    plot_type="3D"        # Visualization type: "2D" or "3D"
)

Example ST-DC Viz

The above image showcases the resulting visualizatiion. It features the following:

An interactive 3D visualization
Two distinct colors to represent the different contexts
The embedding of the focus word (pool) in the different contexts
The n closest words for the focus word in each context
A legend to the side showing the color-context mapping

Example 2: Switching to t-SNE and 2D Visualization

focus_word = "king"
sentences = [
    "The king ruled the land wisely.",
    "The chess king was captured during the game.",
]

viz(
    word=focus_word,
    sentences=sentences,
    dim_technique="tsne",
    num_neighbors=5,
    plot_type="2D"
)

FAQ

1. What pretrained models are supported?

Currently, the package uses bert-base-uncased as the default pretrained model. Support for additional models (like RoBERTa or DistilBERT) can be added by modifying the EmbeddingExtractor class.

2. Can I use my own sentences?

Yes, the sentences parameter in the viz function accepts any list of sentences.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

This package leverages the Hugging Face transformers library for pretrained language models.
Visualization is powered by plotly.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.1.4

Nov 25, 2024

1.1.3

Nov 24, 2024

1.1.2

Nov 24, 2024

This version

1.1.1

Nov 23, 2024

1.1.0

Nov 23, 2024

1.0.0

Nov 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

st_dc-1.1.1.tar.gz (9.7 kB view details)

Uploaded Nov 23, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

st_dc-1.1.1-py3-none-any.whl (8.7 kB view details)

Uploaded Nov 23, 2024 Python 3

File details

Details for the file st_dc-1.1.1.tar.gz.

File metadata

Download URL: st_dc-1.1.1.tar.gz
Upload date: Nov 23, 2024
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for st_dc-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`2d5d214b8808872249c02468117985eee6dbf8fec912220fce7fd15f224ab1e8`
MD5	`2029af8169c75bdeab3f8ee46f1758f4`
BLAKE2b-256	`d1166c98c169dbd108c0ab780e8e0d111d22a3ce896f6643c3b44044ce630163`

See more details on using hashes here.

File details

Details for the file st_dc-1.1.1-py3-none-any.whl.

File metadata

Download URL: st_dc-1.1.1-py3-none-any.whl
Upload date: Nov 23, 2024
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for st_dc-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2b707b9082ebe1f095c8dd18712a2954d8dfb0c806db81aa73e22631034ca7d`
MD5	`482f7a443e9e6d66ec3f57f0a0a821ca`
BLAKE2b-256	`1bf0c3d5c6b11c1baa8f6279f366e182280bcd754b6e00798fd4494a4ccbc1c4`

See more details on using hashes here.

st-dc 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ST-DC: Same Text, Different Context

How It Works

Features

Installation

Requirements

Usage

Example 1: Visualizing Word "Pool"

Example 2: Switching to t-SNE and 2D Visualization

FAQ

1. What pretrained models are supported?

2. Can I use my own sentences?

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes