A framework for performing topic modelling
Project description
Topic Modeling API
This API is built to dynamically perform training, inference, and evaluation for different topic modeling techniques. The API grant common interfaces and command for accessing the different models, make easier to compare them.
A demo is available at http://hyperted.eurecom.fr/topic.
Models
In this repository, we provide:
- Code to perform training, inference, and evaluation for 9 Topic Modeling packages:
- A set of pre-trained models, downloadable from here. **NOTE: Newly trained models are by default stored in
.\models
, replacing the old ones, unless a new model path is given ** - Data files containing pre-processed corpus:
20ng.txt
and20ng_labels.txt
, with 11314 news from the 20 NewsGroup datasetted.txt
with 51898 subtitles of TED Talkstest.txt
andtest_labels.txt
, an extraction of 30 documents from20_ng.txt
, used for testing reason
Each model expose the following functions:
Training the model
m.train(data, num_topics, preprocessing) # => 'success'
Print the list of computed topics
for i, x in enumerate(m.topics):
print(f'Topic {i}')
for word, weight in zip(x['words'], x['weights']):
print(f'- {word} => {weight}')
Access to the info about a specific topic
x = m.topic(0)
words = x['words']
weights= x['weights']
Access to the predictions computed on the training corpus
for i, p in enumerate(m.get_corpus_predictions(topn=3)): # predictions for each document
print(f'Predictions on document {i}')
for topic, confidence in p:
print(f'- Topic {topic} with confidence {confidence}')
# - Topic 21 with confidence 0.03927058187976461
Predict the topic of a new text
pred = m.predict(text, topn=3)
for topic, confidence in pred:
print(f'- Topic {topic} with confidence {confidence}')
# - Topic 21 with confidence 0.03927058187976461
Computing the coherence against a corpus
# coherence: Type of coherence to compute, among <c_v, c_npmi, c_uci, u_mass>. See https://radimrehurek.com/gensim/models/coherencemodel.html#gensim.models.coherencemodel.CoherenceModel
pred = m.coherence(mycorpus, metric='c_v')
print(pred)
#{
# "c_v": 0.5186710138972105,
# "c_v_std": 0.1810477961008996,
# "c_v_per_topic": [
# 0.5845048872767505,
# 0.30693460230781777,
# 0.2611738203246824,
# ...
# ]
#}
Evaluating against a grount truth
# metric: Metric for computing the evaluation, among <purity, homogeneity, completeness, v-measure, nmi>.
res = m.get_corpus_predictions(topn=1)
v = m.evaluate(res, ground_truth_labels, metric='purity')
# 0.7825333630516738
The possible parameters can differ depending on the model.
Use in a Python enviroment
Install this package
pip install tomodapi
Use it in a Python script
from tomodapi import LdaModel
# init the model
m = LdaModel(model_path=path_location)
# train on a corpus
m.train(my_corpus, preprocessing=False, num_topics=10)
# infer topic of a sentence
best_topics = m.predict("In the time since the industrial revolution the climate has increasingly been affected by human activities that are causing global warming and climate change")
topic,confidence = best_topics[0]
# get top words for a given topic
print(m.topic(topic)) #
If the model_path
is not specified, the library will load/save the model from/under models/<model_name>
.
Web API
A web API is provided for accessing to the library as a service
Install dependencies
You should install 2 dependencies:
- mallet 2.0.8 to be placed in
app\builtin
- glove.6B.50d.txt to be placed in
app\builtin\glove
Under UNIX, you can use the download_dep.sh script.
sh download_dep.sh
Start the server
python server.py
Docker
Alternatively, you can run a docker container with
docker-compose -f docker-compose.yml up
The container uses mounted volumes so that you can easily update/access to the computed models and the data files.
Manual Docker installation
docker build -t hyperted/topic .
docker run -p 27020:5000 --env APP_BASE_PATH=http://hyperted.eurecom.fr/topic/api -d -v /home/semantic/hyperted/tomodapi/models:/models -v /home/semantic/hyperted/tomodapi/data:/data --name hyperted_topic hyperted/topic
# Uninstall
docker stop hyperted_topic
docker rm hyperted_topic
docker rmi hyperted/topic
Publications
If you find this library or API useful in your research, please consider citing our paper:
@inproceedings{Lisena:NLPOSS2020,
author = {Pasquale Lisena and Ismail Harrando and Oussama Kandakji and Raphael Troncy},
title = {{ToModAPI: A Topic Modeling API to Train, Use and Compare Topic Models}},
booktitle = {2$^{nd}$ International Workshop for Natural Language Processing Open Source Software (NLP-OSS)},
year = {2020}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tomodapi-1.3.tar.gz
.
File metadata
- Download URL: tomodapi-1.3.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04c9829d4aa7a3e830f91f4e9ffce73e7ebe837dab1561ded6036a8819b3f843 |
|
MD5 | 2c1cbc0e8383808874cc1ea027d315d8 |
|
BLAKE2b-256 | 49a37f5ab23f0d4fd66c58b960fb5813a6b5bdddc81265ada908121cb06b654b |
File details
Details for the file tomodapi-1.3-py3-none-any.whl
.
File metadata
- Download URL: tomodapi-1.3-py3-none-any.whl
- Upload date:
- Size: 5.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a9ab068b4ca96041aded032f2fe1a32cab3867c402c190bb8c045f4d94d3571 |
|
MD5 | 45f11e584ae5d14c5e7cd87632a4d8c0 |
|
BLAKE2b-256 | f3ec262a26829fca6be36498ab977c48a9adc87bdc31741faac3640506f3d5c8 |