Skip to main content

A framework for performing topic modelling

Project description

Topic Modeling API

This API is built to dynamically perform training, inference, and evaluation for different topic modeling techniques. The API grant common interfaces and command for accessing the different models, make easier to compare them.

A demo is available at http://hyperted.eurecom.fr/topic.

Models

In this repository, we provide:

Each model expose the following functions:

Training the model
m.train(data, num_topics, preprocessing) # => 'success'
Print the list of computed topics
for i, x in enumerate(m.topics):
    print(f'Topic {i}')
    for word, weight in zip(x['words'], x['weights']):
        print(f'- {word} => {weight}')
Access to the info about a specific topic
x = m.topic(0)
words = x['words']
weights= x['weights']
Access to the predictions computed on the training corpus
for i, p in enumerate(m.get_corpus_predictions(topn=3)): # predictions for each document
    print(f'Predictions on document {i}')
    for topic, confidence in p:
        print(f'- Topic {topic} with confidence {confidence}')
        # - Topic 21 with confidence 0.03927058187976461
Predict the topic of a new text
pred = m.predict(text, topn=3)
for topic, confidence in pred:
    print(f'- Topic {topic} with confidence {confidence}')
     # - Topic 21 with confidence 0.03927058187976461
Computing the coherence against a corpus
# coherence: Type of coherence to compute, among <c_v, c_npmi, c_uci, u_mass>. See https://radimrehurek.com/gensim/models/coherencemodel.html#gensim.models.coherencemodel.CoherenceModel
pred = m.coherence(mycorpus, metric='c_v')
print(pred)
#{
#  "c_v": 0.5186710138972105,
#  "c_v_std": 0.1810477961008996,
#  "c_v_per_topic": [
#    0.5845048872767505,
#    0.30693460230781777,
#    0.2611738203246824,
#    ...
#  ]
#}
Evaluating against a grount truth
# metric: Metric for computing the evaluation, among <purity, homogeneity, completeness, v-measure, nmi>.
res = m.get_corpus_predictions(topn=1)
v = m.evaluate(res, ground_truth_labels, metric='purity')
# 0.7825333630516738

The possible parameters can differ depending on the model.

Use in a Python enviroment

Install this package

pip install tomodapi

Use it in a Python script

from tomodapi import LdaModel

# init the model 
m = LdaModel(model_path=path_location) 
# train on a corpus
m.train(my_corpus, preprocessing=False, num_topics=10) 
# infer topic of a sentence
best_topics = m.predict("In the time since the industrial revolution the climate has increasingly been affected by human activities that are causing global warming and climate change") 
topic,confidence = best_topics[0] 
# get top words for a given topic
print(m.topic(topic)) # 

If the model_path is not specified, the library will load/save the model from/under models/<model_name>.

Web API

A web API is provided for accessing to the library as a service

Install dependencies

You should install 2 dependencies:

Under UNIX, you can use the download_dep.sh script.

sh download_dep.sh
Start the server
python server.py

Docker

Alternatively, you can run a docker container with

docker-compose -f docker-compose.yml up

The container uses mounted volumes so that you can easily update/access to the computed models and the data files.

Manual Docker installation

docker build -t hyperted/topic .
docker run -p 27020:5000 --env APP_BASE_PATH=http://hyperted.eurecom.fr/topic/api -d -v /home/semantic/hyperted/tomodapi/models:/models -v /home/semantic/hyperted/tomodapi/data:/data --name hyperted_topic hyperted/topic

# Uninstall
docker stop hyperted_topic
docker rm hyperted_topic
docker rmi hyperted/topic

Publications

If you find this library or API useful in your research, please consider citing our paper:

@inproceedings{Lisena:NLPOSS2020,
   author = {Pasquale Lisena and Ismail Harrando and Oussama Kandakji and Raphael Troncy},
   title =  {{ToModAPI: A Topic Modeling API to Train, Use and Compare Topic Models}},
   booktitle = {2$^{nd}$ International Workshop for Natural Language Processing Open Source Software (NLP-OSS)},
   year =   {2020}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomodapi-0.3.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

tomodapi-0.3-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file tomodapi-0.3.tar.gz.

File metadata

  • Download URL: tomodapi-0.3.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.7

File hashes

Hashes for tomodapi-0.3.tar.gz
Algorithm Hash digest
SHA256 4b9f2ebac78a68ceb9bcdb5e3a26fed975c1eeee4ec6461ba3f197757f4a47ab
MD5 113860c3ef3a3a34676633b20bbd915e
BLAKE2b-256 d51c05eaceaceb9d1e96a1da6a699438988c7a9bc9e60a9cc37ecd039fdec7a9

See more details on using hashes here.

File details

Details for the file tomodapi-0.3-py3-none-any.whl.

File metadata

  • Download URL: tomodapi-0.3-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.7

File hashes

Hashes for tomodapi-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 60f4461ab2793b2e573f9151639f5e48a5cd9b3446dd22924443f350ef2ad161
MD5 0eda37d6c66994bfb14e69db06c9bb8a
BLAKE2b-256 781b318e0aef38c6b18e0cb4d3459d30f972c020c0e655657973be4506de1b04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page