Skip to main content

Topic Context Model (TCM).

Project description

Topic Context Modell (TCM)

Calculates the surprisal of a word given a context.

Tests pypi Version

Requirements

  • Python >= 3.10
  • scipy
  • scikit-learn

Usage

$ python tcm.py -h
usage: tcm [-h] [-V] [-m {lda,lsa}] [--model-file MODEL_FILE] [--data DATA [DATA ...]]
           [--fields FIELDS [FIELDS ...]] [--words WORDS] [--n-components N_COMPONENTS]
           [--doc-topic-prior DOC_TOPIC_PRIOR] [--topic-word-prior TOPIC_WORD_PRIOR]
           [--learning-method LEARNING_METHOD] [--learning-decay LEARNING_DECAY] [--learning-offset LEARNING_OFFSET]
           [--max-iter MAX_ITER] [--batch-size BATCH_SIZE] [--evaluate-every EVALUATE_EVERY] [--perp-tol PERP_TOL]
           [--mean-change-tol MEAN_CHANGE_TOL] [--max-doc-update-iter MAX_DOC_UPDATE_ITER] [--n-jobs N_JOBS]
           [--random-state RANDOM_STATE] [-v] [--log-format LOG_FORMAT] [--log-file LOG_FILE]
           [--log-file-format LOG_FILE_FORMAT]
           {train,surprisal} [{train,surprisal} ...]

positional arguments:
  {train,surprisal}     what to do, train lda/lsa or calculate surprisal.

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -m {lda,lsa}, --model {lda,lsa}
                        which model to use. (default: lda)
  --model-file MODEL_FILE
                        file to load model from or save to, if path exists tries to load model. (default: lda.jl.z)
  --data DATA [DATA ...]
                        file(s) to load texts from, either txt or csv optionally gzip compressed. (default: None)
  --fields FIELDS [FIELDS ...]
                        field(s) to load texts when using csv data. (default: None)
  --words WORDS         file to load words from and/or save to, either txt or json optionally gzip compressed. (default: words.txt.gz)
  -v, --verbose         verbosity level; multiple times increases the level, the maximum is 3, for debugging. (default: 0)
  --log-format LOG_FORMAT
                        set logging format. (default: %(message)s)
  --log-file LOG_FILE   log output to a file. (default: None)
  --log-file-format LOG_FILE_FORMAT
                        set logging format for log file. (default: [%(levelname)s] %(message)s)

LDA config:
  --n-components N_COMPONENTS
                        number of topics. (default: 10)
  --doc-topic-prior DOC_TOPIC_PRIOR
                        prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. (default: None)
  --topic-word-prior TOPIC_WORD_PRIOR
                        prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. (default: None)
  --learning-method LEARNING_METHOD
                        method used to update `_component`. (default: batch)
  --learning-decay LEARNING_DECAY
                        it is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is `n_samples`, the update method is same as batch learning. In the literature, this is called kappa. (default: 0.7)
  --learning-offset LEARNING_OFFSET
                        a (positive) parameter that downweights early iterations in online learning.  It should be greater than 1.0. In the literature, this is called tau_0. (default: 10.0)
  --max-iter MAX_ITER   the maximum number of passes over the training data (aka epochs). (default: 10)
  --batch-size BATCH_SIZE
                        number of documents to use in each EM iteration. Only used in online learning. (default: 128)
  --evaluate-every EVALUATE_EVERY
                        how often to evaluate perplexity. Set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold. (default: -1)
  --perp-tol PERP_TOL   perplexity tolerance in batch learning. Only used when `evaluate_every` is greater than 0. (default: 0.1)
  --mean-change-tol MEAN_CHANGE_TOL
                        stopping tolerance for updating document topic distribution in E-step. (default: 0.001)
  --max-doc-update-iter MAX_DOC_UPDATE_ITER
                        max number of iterations for updating document topic distribution in the E-step. (default: 100)
  --n-jobs N_JOBS       the number of jobs to use in the E-step. `None` means 1. `-1` means using all processors. (default: None)
  --random-state RANDOM_STATE
                        pass an int for reproducible results across multiple function calls. (default: None)

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_context_model-0.1.5.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

topic_context_model-0.1.5-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file topic_context_model-0.1.5.tar.gz.

File metadata

  • Download URL: topic_context_model-0.1.5.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for topic_context_model-0.1.5.tar.gz
Algorithm Hash digest
SHA256 98fa61eb6f3287595b6c674d6b34ac7f07a62bc2d3d3e70967674cb7e75e28ca
MD5 5a512c48116af337c148d4322d2282d9
BLAKE2b-256 de9a9267cddac67a8afe5ccce5d343c5791e5d4b2b66f52ec32af9f5b2cfaf94

See more details on using hashes here.

File details

Details for the file topic_context_model-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for topic_context_model-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f9e92d9d7de2e172ef3d36bc324c9384810841d705e0514cc5d69f68e4fbf0f5
MD5 2d1310fdf4c6f1ad6ace9b54a2a2aa80
BLAKE2b-256 aac2abda15e472a0b8349c27122a4a498123b3ab3b50301bf7214ed6478658e3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page