Skip to main content

Pipeline for training LSA models using Scikit-Learn.

Project description

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

  1. install pipeline:
pip install latent-semantic-analysis
  1. run pipeline:
  • either in terminal:
lsa-train --path_to_config config.yaml
  • or in python:
import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

  • config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

  • terminal:
lsa-train --path_to_config config.yaml
  • python:
import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
  • config.yaml - config that was used to train the model
  • logging.txt - logging file
  • doc2topic.json - document embeddings
  • term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latent-semantic-analysis-0.1.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

latent_semantic_analysis-0.1.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file latent-semantic-analysis-0.1.0.tar.gz.

File metadata

  • Download URL: latent-semantic-analysis-0.1.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.3

File hashes

Hashes for latent-semantic-analysis-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c7fda93d959bb33fe6ef1814da4f33ff898687a603218617370a35adae6373e4
MD5 d80add95b162861e06e13ef098025e53
BLAKE2b-256 24060548a0ebae5d68812e3dba2396aa26967ae400e9a8dd123dd87db5fd245d

See more details on using hashes here.

File details

Details for the file latent_semantic_analysis-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: latent_semantic_analysis-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.3

File hashes

Hashes for latent_semantic_analysis-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6951542da4b005ced22dde0f8611fbdb59a9ebe4cc773a73bf7ed322f401c64c
MD5 b54e6973ce6c0abc43757e4863e02451
BLAKE2b-256 28fb6bc524b2d2abc2ba57039e8ce02fd8f40845c0f242c9ef67591b0b67c1bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page