Skip to main content

Text Dataset Categories/Clusters Visualization with Word Clouds

Project description

Faced with a collection of texts, sorted into the categories “C1”-“C23” and no idea what those could be? Got a dump of text documents and need to figure out what they are about and which of those you should have a closer look at? Code is here to help!

This repository contains tools, which help in getting a quick overview of a text dataset by creating word clouds of the relevant words for each class or identified cluster as well as code to highlight these words in the individual texts, e.g. to better understand classifier decisions. Further details can be found in the corresponding paper (short and long).

If any of this code was helpful for your research, please consider citing it:

@article{horn2017exploring,
  title     = {Exploring text datasets by visualizing relevant words},
  author    = {Horn, Franziska and Arras, Leila and Montavon, Gr{\'e}goire and M{\"u}ller, Klaus-Robert and Samek, Wojciech},
  journal   = {arXiv preprint arXiv:1707.05261},
  year      = {2017}
}

or

@article{horn2017discovering,
  title     = {Discovering topics in text datasets by visualizing relevant words},
  author    = {Horn, Franziska and Arras, Leila and Montavon, Gr{\'e}goire and M{\"u}ller, Klaus-Robert and Samek, Wojciech},
  journal   = {arXiv preprint arXiv:1707.06100},
  year      = {2017}
}

The code is intended for research purposes. It was programmed for Python 2.7, but should theoretically also run on newer Python 3 versions - no guarantees on this though (open an issue if you find a bug, please)!

quick start

To install, either download the code from here and include the textcatvis folder in your $PYTHONPATH or install (the library components only) via pip:

$ pip install textcatvis

If you have text data available as a collection of .txt files either in a single folder or in multiple folders (in case of texts already sorted in different categories), you can call the script analyze_relevantwords.py with the path to the folder (or parent directory of multiple folders) to load this data and create word clouds for it.

textcatvis library components

dependencies: numpy, scipy, matplotlib, sklearn, wordcloud, nlputils

  • data_utils.py: contains a function to load a text dataset (organized in a folder with subdirectories for each class containing .txt documents) in the form required by the other functions.

  • cluster.py: contains a function to cluster a collection of text documents with the DBSCAN algorithm from sklearn.

  • check_query.py: contains functions to formulate queries and check how often a term occurs in texts of a given category.

  • vis_utils.py: contains functions to create the word clouds and highlight relevant words in individual texts.

  • distinctive_words.py: contains code to examine a text dataset and identify “distinctive words” by comparing how often a word occurs in one category compared to all others.

  • visualize_relevantwords.py: contains 3 functions to generate word clouds and highlight words in individual documents based on tf-idf features, distinctive words, as well as the classification scores obtained with a linear SVM.

examples

  • analyze_relevantwords.py: can be called with a path to a dataset to carry out the analysis for this dataset, i.e. create word clouds for different classes etc.

  • in experiments_cancer.py, the above mentioned tools are tested on the cancer papers dataset to create the results reported in the paper. (You need to download this dataset first.)

  • in experiments_nytimes.py, the above mentioned tools are tested on articles downloaded with the NYTimes API. (Make sure you have an API key stored in nytimes_apikey.txt.)

If you have any questions please don’t hesitate to send me an email and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textcatvis-1.0.4.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

textcatvis-1.0.4-py2.py3-none-any.whl (17.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file textcatvis-1.0.4.tar.gz.

File metadata

  • Download URL: textcatvis-1.0.4.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for textcatvis-1.0.4.tar.gz
Algorithm Hash digest
SHA256 726b476bc9789ce03cba77948473e7a305d01d7d33cac97950e75c594583f496
MD5 76ec5f6007d6629fed671b40c1bfc889
BLAKE2b-256 c32948893dbae00c28cc2f0c1c1db7c7433f6939be495aad89524fdb37c153c3

See more details on using hashes here.

File details

Details for the file textcatvis-1.0.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for textcatvis-1.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ddcaec87ee77dd4bf32d0d6731bcf2e00ad38450fa4751714c1c65f7987c57f1
MD5 ab107742e8cdec6b3e9e1984ebf291ef
BLAKE2b-256 15af9eee0dbf0bc9764fd39583fb08f95b93ea7ad6c81e22f66894589171dfb4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page