Skip to main content

Aligned Neural Topic Model for Exploring Evolving Topics

Project description

PyPI - PyPi MIT license arXiv

ANTM

ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics

alt text

Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. Here, we introduce a dynamic neural topic model called ANTM, which uses document embeddings (data2vec) to compute clusters of semantically similar documents at different periods, and aligns document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).

Installation

Installation can be done using:

pip install antm

Quick Start

As implemented in the notebook, we can quickly start extracting evolving topics from DBLP dataset containing computer science articles.

To Fit and Save a Model

from antm import ANTM
import pandas as pd

# load data
df=pd.read_parquet("./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet")
df=df[["abstract","year"]].rename(columns={"abstract":"content","year":"time"})
df=df.dropna().sort_values("time").reset_index(drop=True).reset_index()

# choosing the windows size and overlapping length for time frames
window_size = 6
overlap = 2

#initialize model
model=ANTM(df,overlap,window_size,umap_n_neighbors=10, partioned_clusttering_size=5,mode="data2vec",num_words=10,path="./saved_data")

#learn the model and save it
topics_per_period=model.fit(save=True)
#output is a list of timeframes including all the topics associated with that period

To Load a Model

from antm import ANTM
import pandas as pd

# load data
df=pd.read_parquet("./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet")
df=df[["abstract","year"]].rename(columns={"abstract":"content","year":"time"})
df=df.dropna().sort_values("time").reset_index(drop=True).reset_index()

# choosing the windows size and overlapping length for time frames
window_size = 6
overlap = 2
#initialize model
model=ANTM(df,overlap,window_size,mode="data2vec",num_words=10,path="./saved_data")
topics_per_period=model.load()

Plug-and-Play Functions

#find all the evolving topics
model.save_evolution_topics_plots(display=False)

#plots a random evolving topic with 2-dimensional document representations
model.random_evolution_topic()

#plots partioned clusters for each time frame
model.plot_clusters_over_time()

#plots all the evolving topics
model.plot_evolving_topics()

Topic Quality Metrics

#returns pairwise jaccard diversity for each period
model.get_periodwise_pairwise_jaccard_diversity()

#returns proportion unique words diversity for each period
model.get_periodwise_puw_diversity()

#returns topic coherence for each period
model.get_periodwise_topic_coherence(model="c_v") 

Datasets

Arxiv articles

DBLP articles

Elon Musk's Tweets

New York Times News

Experiments

You can use the notebooks provided in "./experiments" in order to run ANTM on other sequential datasets.

Citation

To cite ANTM, please use the following bibtex reference:

@misc{rahimi2023antm,
      title={ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics}, 
      author={Hamed Rahimi and Hubert Naacke and Camelia Constantin and Bernd Amann},
      year={2023},
      eprint={2302.01501},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

antm-0.1.3.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

antm-0.1.3-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file antm-0.1.3.tar.gz.

File metadata

  • Download URL: antm-0.1.3.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for antm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 71818568a77aaeb355d4ccf3e6b7d781d3e5f2af8ca5b089d84de14f07f0c3ac
MD5 6827720c8bfa6845128e220ee78a2c32
BLAKE2b-256 2f7f6548d43cd6594c868daf3d22f42a5eb00179c684cf474a8edf3c779248da

See more details on using hashes here.

File details

Details for the file antm-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: antm-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for antm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3c23c77432c07125d1e4680540f0b51b8c7ca4adbfcdda93c75de49d22923591
MD5 5683e056d1bcbbe36bc6d71382f9e68c
BLAKE2b-256 ebd4202ec73adf8c7840ee93c4c4c596f461a9af5aa1517cc385f485857901e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page