Explicit Semantic Analysis
Project description
ESA-Wiki
Explicit Semantic Analysis based on Wikipedia
This is a python library which contains code to 1) construct a semantic interpreter based on data from Wikipedia and 2) apply this to various kinds of texts.
To construct an interpreter, first obtain a Wikipedia XML dump from http://dumps.wikimedia.org/enwiki/
-
Then run
python3 -m esa_wiki.xml_parse <file>
with the downloaded file as its argument. This outputs some temporary files containing information on the words, links and articles encountered. -
Next, run
python3 -m esa_wiki.generate_indices
to generate lists of indices corresponding to unique words and articles encountered -
Finally, run
python3 -m esa_wiki.matrix_builder
to construct a very large sparse interpretation matrix. Each row corresponds to a unique word, each column to a 'concept', i.e. a Wikipedia article, and each entry is the TF-IDF score for word i in article j. The Matrix is saved in separate chunks to conserve memory.
medium_wiki.xml can be used as an example file for demonstration/testing purposes, as it contains only the first 100 or so Wikipedia articles.
cunning_linguistics.py then contains classes to perform text analysis and harvest tweets for analysis.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.