scattertext

An NLP libarary to help find interesting terms in small to medium-sized corpora.

Project description

# Scattertext A tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in a sexy, interactive scatter plot with non-overlapping term labels. Exploratory data analysis just got more fun.

## Installation

### Minimal

$ pip install scattertext

### Full

$ pip install scattertext spacy mpld3

In order to use ScatterChart.draw, matplotlib and mpld3 need to be installed. Spacy is highly recommended for parsing.

## Quickstart

See the Jupyter [notebook](https://jasonkessler.github.io/Subjective%2Bvs.%2BObjective.html) for a tutorial on using Scattertext to find language that distinguishes subjective of and objective descriptions of movies.

## Introduction

This is a tool that’s intended for visualizing what words and phrases: are more characteristic of a category than others.

The documentation (including this readme) is a work in progress. Do look through the Jupyter [notebook](https://jasonkessler.github.io/Subjective%2Bvs.%2BObjective.html) and the test suite for instructions on how to use this package.

There are term importance algorithms that have been implemented in this library that are not available anywhere else. Feel free to poke around, make suggestions, and ask any questions while I figure out the docs.

In the mean time, here’s an example of on of the things the tool can do– a scatter chart showing language differences between Democratic and Republican speakers in the 2012 American Political Conventions. Click [here](https://jasonkessler.github.io/fig.html) for an interactive version, and check out a pure-D3 version with fancy non-overlapping word annotations (https://jasonkessler.github.io/demo.html).

Please see the Jupyter [notebook](https://jasonkessler.github.io/Subjective%2Bvs.%2BObjective.html) for a tutorial on using Scattertext to find language that distinguishes subjective of and objective descirptions of movies.

![Differences in 2012 American Political Convention Speeches](https://raw.githubusercontent.com/JasonKessler/text-to-ideas/master/2012_conventions.png)

## Technical Underpinnings

Please see this [deck](https://www.slideshare.net/JasonKessler/turning-unstructured-content-into-kernels-of-ideas) for an introduction to the metrics and algorithms used.

## Sources * Political data: scraped from [here](http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html?_r=0) * count_1w: Peter Norvig assembled this file (downloaded from [norvig.com](http://norvig.com/ngrams/count_1w.txt)). See http://norvig.com/ngrams/ for an explanation of how it was gathered from a very large corpus. * hamlet.txt: William Shakespeare. From [shapespeare.mit.edu](http://shakespeare.mit.edu/hamlet/full.html) * Inspiration for text scatter plots: Rudder, C. (2014). Dataclysm: Who we are when we think no one’s looking.

Project details

Release history Release notifications | RSS feed

0.2.2

Sep 23, 2024

0.2.1

Mar 6, 2024

0.2.0

Feb 15, 2024

0.1.19

Apr 18, 2023

0.1.18

Mar 30, 2023

0.1.17

Feb 28, 2023

0.1.16

Feb 27, 2023

0.1.15.1

Feb 27, 2023

0.1.15

Feb 27, 2023

0.1.14

Feb 24, 2023

0.1.12

Feb 2, 2023

0.1.11

Jan 17, 2023

0.1.10

Dec 8, 2022

0.1.9

Nov 11, 2022

0.1.8

Nov 3, 2022

0.1.7

Oct 6, 2022

0.1.6

Mar 26, 2022

0.1.5

Nov 15, 2021

0.1.4

Jul 7, 2021

0.1.3

Jun 20, 2021

0.1.2

Mar 8, 2021

0.1.1

Mar 8, 2021

0.1.0.0

Jan 18, 2021

0.0.2.75

Dec 18, 2020

0.0.2.74

Dec 14, 2020

0.0.2.73

Dec 10, 2020

0.0.2.72

Nov 9, 2020

0.0.2.71

Oct 12, 2020

0.0.2.70

Oct 11, 2020

0.0.2.69

Oct 8, 2020

0.0.2.68

Sep 16, 2020

0.0.2.67

Jul 19, 2020

0.0.2.66

Jul 13, 2020

0.0.2.65

Jun 8, 2020

0.0.2.64

Apr 26, 2020

0.0.2.63

Apr 19, 2020

0.0.2.62

Mar 29, 2020

0.0.2.60

Mar 16, 2020

0.0.2.59

Feb 23, 2020

0.0.2.58

Feb 11, 2020

0.0.2.57

Feb 11, 2020

0.0.2.56

Dec 4, 2019

0.0.2.55

Oct 8, 2019

0.0.2.53

Aug 12, 2019

0.0.2.52

May 9, 2019

0.0.2.51

Apr 30, 2019

0.0.2.50

Apr 26, 2019

0.0.2.49

Apr 25, 2019

0.0.2.48

Apr 24, 2019

0.0.2.47

Apr 23, 2019

0.0.2.46

Apr 19, 2019

0.0.2.45

Apr 18, 2019

0.0.2.43

Feb 22, 2019

0.0.2.42

Feb 15, 2019

0.0.2.41

Feb 5, 2019

0.0.2.40

Jan 29, 2019

0.0.2.39

Jan 29, 2019

0.0.2.38

Jan 28, 2019

0.0.2.37

Jan 26, 2019

0.0.2.36

Jan 8, 2019

0.0.2.35

Jan 6, 2019

0.0.2.34

Dec 29, 2018

0.0.2.33

Dec 28, 2018

0.0.2.31

Dec 5, 2018

0.0.2.29

Jul 13, 2018

0.0.2.28

Jun 15, 2018

0.0.2.27.1

May 16, 2018

0.0.2.26.1

May 4, 2018

0.0.2.26

May 4, 2018

0.0.2.25

Apr 27, 2018

0.0.2.24

Apr 17, 2018

0.0.2.23

Mar 20, 2018

0.0.2.22

Mar 14, 2018

0.0.2.20.1

Mar 2, 2018

0.0.2.20

Feb 23, 2018

0.0.2.19

Feb 19, 2018

0.0.2.18

Feb 2, 2018

0.0.2.17

Jan 26, 2018

0.0.2.16.1

Jan 9, 2018

0.0.2.16

Jan 9, 2018

0.0.2.15

Dec 8, 2017

0.0.2.14.1

Dec 5, 2017

0.0.2.14

Dec 5, 2017

0.0.2.9.13.1

Oct 27, 2017

0.0.2.9.13

Oct 26, 2017

0.0.2.9.12

Oct 23, 2017

0.0.2.9.11

Oct 20, 2017

0.0.2.9.10

Sep 15, 2017

0.0.2.9.9

Aug 22, 2017

0.0.2.9.8

Aug 9, 2017

0.0.2.9.7

Aug 9, 2017

0.0.2.9.6

Aug 4, 2017

0.0.2.9.5

Jul 28, 2017

0.0.2.9.4

Jul 28, 2017

0.0.2.9.3

Jul 28, 2017

0.0.2.9.2

Jul 27, 2017

0.0.2.9.1

Jul 27, 2017

0.0.2.9.0

Jul 27, 2017

0.0.2.8.6

Jul 14, 2017

0.0.2.8.5

Jul 6, 2017

0.0.2.8.4

Jul 6, 2017

0.0.2.8.3

Jul 6, 2017

0.0.2.8.2

Jul 6, 2017

0.0.2.8.1

Jul 5, 2017

0.0.2.8.0

Jul 5, 2017

0.0.2.7.1

Jun 19, 2017

0.0.2.7.0

Jun 15, 2017

0.0.2.6.0

Jun 14, 2017

0.0.2.5.0

Jun 2, 2017

0.0.2.4.7

Jun 2, 2017

0.0.2.4.6

May 12, 2017

0.0.2.4.5

Mar 14, 2017

0.0.2.4.4

Mar 13, 2017

0.0.2.4.3

Mar 11, 2017

0.0.2.4.2

Mar 8, 2017

0.0.2.4.1

Mar 8, 2017

0.0.2.4

Mar 8, 2017

0.0.2.3

Mar 2, 2017

0.0.2.2

Feb 10, 2017

0.0.2.1.5

Jan 28, 2017

0.0.2.1.4.1

Jan 24, 2017

0.0.2.1.4

Jan 24, 2017

0.0.2.1.3

Jan 23, 2017

0.0.2.1.2

Jan 19, 2017

0.0.2.1.1

Jan 18, 2017

0.0.2.1.0

Jan 17, 2017

0.0.2.0.0

Jan 13, 2017

0.0.1.9.9

Jan 12, 2017

0.0.1.9.7

Jan 7, 2017

0.0.1.9.5

Jan 7, 2017

0.0.1.9.3

Jan 4, 2017

0.0.1.9.1

Dec 12, 2016

0.0.1.9.0

Dec 6, 2016

0.0.1.8.11

Dec 5, 2016

0.0.1.8.9

Oct 8, 2016

0.0.1.8.8

Oct 6, 2016

0.0.1.8.7

Oct 6, 2016

0.0.1.8.5

Sep 27, 2016

0.0.1.8.4

Sep 16, 2016

This version

0.0.1.8.3

Aug 18, 2016

0.0.1.8.2

Aug 10, 2016

0.0.1.8.1

Aug 8, 2016

0.0.1.8

Aug 4, 2016

0.0.1.7.3

Aug 4, 2016

0.0.1.7.2

Aug 4, 2016

0.0.1.7.1

Aug 3, 2016

0.0.1.7

Aug 2, 2016

0.0.1.6

Jul 28, 2016

0.0.1.5

Jul 28, 2016

0.0.1.4

Jul 27, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scattertext-0.0.1.8.3.tar.gz (5.0 MB view hashes)

Uploaded Aug 18, 2016 Source

Built Distributions

scattertext-0.0.1.8.3-py3-none-any.whl (4.9 MB view hashes)

Uploaded Aug 18, 2016 Python 3

scattertext-0.0.1.8.3-py2-none-any.whl (4.9 MB view hashes)

Uploaded Aug 18, 2016 Python 2

Hashes for scattertext-0.0.1.8.3.tar.gz

Hashes for scattertext-0.0.1.8.3.tar.gz
Algorithm	Hash digest
SHA256	`0f820b2f7904f218cc43dcb08b55e992150cdea97c2b74d5a4a9217c60943017`
MD5	`db8041ac70aaed359cb7f1f8c7abfc97`
BLAKE2b-256	`6c8ab07c7ad6cb6ed2bdb8abcd445f897e7cc4cd7ffa830b64d24ff96d0a3027`

Hashes for scattertext-0.0.1.8.3-py3-none-any.whl

Hashes for scattertext-0.0.1.8.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d80fc84d3b8b0765d229f37defe7d5364bdde15fd1c2d41d2c87c8fac3676025`
MD5	`d55c2f2fe8a445b89ca7bc99318129f0`
BLAKE2b-256	`97994b88de0058eb4a69f1b1d209f4746e0a18a8d75cb8aaf6b69b649f44c7b2`

Hashes for scattertext-0.0.1.8.3-py2-none-any.whl

Hashes for scattertext-0.0.1.8.3-py2-none-any.whl
Algorithm	Hash digest
SHA256	`4c2d42b3c9770cc32b76852429793db24cc96969cf02e1cbd0f98cb38646943c`
MD5	`9d1ba57c6a43b9e386753153d4ea6766`
BLAKE2b-256	`9eb4c2f2d6ac42c99cf2c00477125e63068d84d4f9f61d689d53421526f9f3c0`