Skip to main content

Topic Modeling Evaluation

Project description

Topic Modeling Evaluation

A toolkit to quickly evaluate model goodness over number of topics

Metrics

Coherence measure to be used.

  • Fastest method - 'u_mass', 'c_uci' also known as c_pmi.

  • For 'u_mass' corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary.

  • For 'c_v', 'c_uci' and 'c_npmi' texts should be provided (corpus isn't needed)

Examples

Example 1: estimate metrics for one topic model with specific number of topics

from tm_eval import *
# load a dictionary with document key and its term list split by ','.
input_file = "datasets/covid19_symptoms.pickle"
output_folder = "outputs"
model_name = "symptom"
num_topics = 10
# run
results = evaluate_all_metrics_from_lda_model(input_file=input_file, 
                                              output_folder=output_folder,
                                              model_name=model_name, 
                                              num_topics=num_topics)
print(results)

Example 2: find model goodness change over number of topics

from tm_eval import *
if __name__=="__main__":
    # start configure
    # load a dictionary (key,value) with document id as key and its term list combined by ',' as value.
    input_file = "datasets/covid19_symptoms.pickle"
    output_folder = "outputs"
    model_name = "symptom"
    start=2
    end=5
    # end configure
    # run and explore

    list_results = explore_topic_model_metrics(input_file=input_file, 
                                               output_folder=output_folder,
                                               model_name=model_name,
                                               start=start,
                                               end=end)
    # summarize results
    show_topic_model_metric_change(list_results,save=True,
                                   save_path=f"{output_folder}/metrics.csv")

    # plot metric changes
    plot_tm_metric_change(csv_path=f"{output_folder}/metrics.csv",
                          save=True,save_folder=output_folder)

Output results

c_v

u_mass

c_npmi

c_uci

License

The tm-eval toolkit is provided by Donghua Chen with MIT License.

References

  1. Topic Modeling in Python: Latent Dirichlet Allocation (LDA)
  2. Evaluate Topic Models: Latent Dirichlet Allocation (LDA)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tm-eval-0.0.2.tar.gz (11.9 kB view hashes)

Uploaded Source

Built Distribution

tm_eval-0.0.2-py3-none-any.whl (9.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page