Code to run the Extra algorithm for unsupervised topic extraction.
Project description
TODO: update badges
extra-model
Code to run the Extra algorithm for unsupervised topic extraction.
Quick start
IMPORTANT:
- When running Extra inside docker-container, make sure that Docker process has enough resources. For example, on Mac/Windows it should have at least 8 Gb of RAM available to it.
- GitHub repo does not come with Glove Embeddings. See the next section for how to download the reuired embeddings.
Downloading Embeddings
This package does not come with the required Glove embeddings and they must be downloaded before use.
To download the required embeddings, run the following command:
docker-compose run --rm setup
The embeddings will be downloaded, unzipped and formatted into a space-efficient format. Files will be saved in the embeddings/
directory in the root of the project directory. If the process fails, it can be safely restarted. If you want to restart the process with new files, delete all files except README.md
in the embeddings/
directory.
Using docker-compose
First, build the image:
docker-compose build
Then running extra-model
is as simple as:
docker-compose run extra-model /package/tests/resources/100_comments.csv
NOTE: when using this approach, input file should be mounted inside the container.
By default, everything from extra-model
folder will be mounted to /package/
folder.
This can be changed in docker-compose.yaml
This will produce a result.csv
file in /io/
(default setting) folder.
Location of the output can be changed by supplying second path, e.g.:
docker-compose run extra-model /package/tests/resources/100_comments.csv /io/another_folder
Using command line
TODO: add this section.
extra-model
input
Input of an extra is a .csv
file with 2 columns: CommentId
and Comments
.
Both must be present and named exactly in that way.
extra-model
output
After extra-model
finishes calculations, it'll produce a .csv
file with following structure:
AdCluster,Aspect,AspectCount,CommentId,Descriptor,Position,SentimentBinary,SentimentCompound,Topic,TopicCount,TopicImportance,TopicSentimentBinary,TopicSentimentCompound,WordnetNode
only,downside,1,321,only,9,0.0,0.0,downside.n.01,1,0.005572645018795278,0.0,0.0,downside.n.01
more,nothing,1,74,more,54,0.0,0.0,nothing.n.01,1,0.005572645018795278,0.0,0.0,nothing.n.01
clean,bathrooms,1,146,clean,4,1.0,0.4019,toilet.n.01,1,0.005572645018795278,1.0,0.4019,toilet.n.01
decorated,place,5,146,decorated,32,0.0,0.0,home.n.01,6,0.03343587011277168,0.0,-0.01131666666666666,home.n.01
Columns have following meaning:
Column | Description |
---|---|
AdCluster | Adjectives are clustered together and this indicates the "center" of a cluster (e.g., "awesome", "fantastic", "great" descriptors might produce "great" as AdCluster ) |
Aspect | Identified aspect - this is an actual word that person wrote in a text |
AspectCount | How often this aspect has been found in all of the input |
CommentId | ID of an input. Since one input may produce multiple aspects, ID column must always be present |
Descriptor | Identified adjective (not clustered) - this is an actual word that person wrote in a text |
Position | Character number where aspect was found (e.g., "nice shirt" will have aspect "shirt" and Position 6 |
SentimentBinary | Binary sentiment for aspect |
SentimentCompound | Compound sentiment for aspect |
Topic | Collection of aspects. |
TopicCount | How often topic has been found in input |
TopicImportance | Importance of a topic |
TopicSentimentBinary | Similar to aspect, but on a topic level |
TopicSentimentCompound | Similar to aspect, but on a topic level |
WordnetNode | Mapping to wordnet node. Identifiers in the form .n.01 mean first meaning of the noun in wordnet |
Extra workflow
The workflow follows the algorithm suggested in the paper and has following stages:
Filtering (_filter.py
)
Get rid of cruft in the input data:
- empty text fields
- requires at least 20 characters of text
- remove unprintable unicode characters
- filter for english language using Googles
cld2
tool
Generate aspects (_aspects.py
)
Extracts promising phrases (i.e., nouns described by adjectives) using spacy
.
Aggregate aspects into topics (_topics.py
)
Takes the output of the phrase extraction, maps them to wordnet
(via _disambiguate.py
) and produces the list of clustered aspects
important dependencies:
sklearn
for clusteringnltk
for thewordnet
networkx
for the semantic tree- pretrained word-vectors (via
_vectorizer.py
) vaderSentiment
for sentiment analysis
Analyze descriptors (_adjectives.py
)
Cluster the associated adjectives using constant radius clustering.
Link information (_summarize.py
)
To make the output more useful, we want to link the topics back to the original texts and vice versa.
The whole code produces one csv file.
CI
TODO: update this section
This project comes with a GitHub Actions pipeline definition.
Develop
First, please install docker on your computer. Docker must be running correctly for these commands to work.
- If you are using windows, please make sure your editor writes files with the linefeed (
\n
) line endings.*
Next, clone the repo:
TODO: update this
git clone
cd extra-model
Then run the test suite to see if docker is set up correctly:
docker-compose run test
Testing
You'll be unable to merge code unless the linting and tests pass. You can run these in your container via docker-compose run test
.
The tests, linting, and code coverage are run automatically via CI, and you'll see the output on your pull requests.
Generally we should endeavor to write tests for every feature. Every new feature branch should increase the test coverage rather than decreasing it.
We use pytest as our testing framework.
To test/lint your project, you can run docker-compose run test
.
Stages
TODO: update this
To customize / override a specific testing stage, please read the documentation specific to that tool:
- PyTest: https://docs.pytest.org/en/latest/contents.html
- Black: https://black.readthedocs.io/en/stable/
- Flake8: http://flake8.pycqa.org/en/latest/
- Bandit: https://bandit.readthedocs.io/en/latest/
- iSort: https://pycqa.github.io/isort/
- pydocstyle: http://www.pydocstyle.org/en/stable/
Documentation
TODO: change to sphinx site:
Check out the project documentation.
Authors
extra-model
was written by mbalyasin@wayfair.com
, mmozer@wayfair.com
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file extra-model-0.2.0rc1.tar.gz
.
File metadata
- Download URL: extra-model-0.2.0rc1.tar.gz
- Upload date:
- Size: 34.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa938f5d5a78b2aced51353f85aa64dcb4c41239f0cee5c0f1451d05c29e2c36 |
|
MD5 | df9799325886d69b2e7bb03b1eb3d930 |
|
BLAKE2b-256 | ed19648d304880d053ddd7285ddb66c7fdc2d602eadf8f6772492d2d9c593ce5 |
File details
Details for the file extra_model-0.2.0rc1-py3-none-any.whl
.
File metadata
- Download URL: extra_model-0.2.0rc1-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c27782ccbd0320efaae736c358ecc1694508f775765cf9704b75133fb4d579c6 |
|
MD5 | 86bd86c3a57f2701ab6b2163a4c144eb |
|
BLAKE2b-256 | c8a2d9b1fcb2793bc513d8d193e4cf00e60b97183819dfef2984d79efb7ca33b |