Skip to main content

Basic Latent Semantic Analysis library

Project description

A Latent Semantic Analysis Library
==================================

Author(s): Keith Murray

Contact: kmurrayis@gmail.com

Requirements:
=============
Python 2.7.6
Standard Libraries:
os, sys, and math
Added Libraries:
numpy, sklearn

Installation:
=============
```python
pip install lsalib
```


Usage:
======
This library is a termDocMatrix class.
It was built to follow the Thesis of Sam Way, found here http://digitalcommons.unl.edu/elecengtheses/42/

```python
>>> import lsalib
# To use this, initalize a varible,
>>> lsa = lsalib.termDocMatrix()

# After this, you can add documents to the matrix. This can be done in a number of ways
# With Strings:
>>> lsa.add("HELLO WORLD! THE WIND RISES, WE MUST TRY TO LIVE")

# With Dictionaries (a key:count relationship)
>>> lsa.add({"tree":5, "apple":3, "WORLD":8, "planes":2})

# With lists of strings:
>>> lsa.add(["apples", "oranges", "apples", "WORLD", "HELLO"])

# With lists of dictionaries which follow the key:count relationship:
>>> lsa.add([D1, D2, D3, D4])
```

It's important to note that there is no processing done on any of the inputs.
This means the inputs are case sensitive, any symbol such as a comma tied to a word will
also be included in the term list.
Therefore "Apples", "apples", and "apples," are all treated as unique words.
If this is undesirable, the strings will need to be preprocessed before lsa.add() is called.

As each document is added to the matrix, a term frequency weighting is applied.




```python
# Once all documents are added to the matrix, the inverse document frequency weighting
# can be called:
>>> lsa.weight_idf()

# And once that has completed, use lsa.nmf to reduce the weighted term doc matrix
# to it's basis k components for the terms and documents:
>>> P, Q = lsa.nmf(5)

# P is the basis vector set for the terms, and has a dimensionality of terms x k,
# Q is the basis vector set for the documents, and has a dimensionality of docs x k,
# P x Q.T will yield an approximation of the original term document matrix with a certain error

# This error is stored in lsa.er
>>> print lsa.er
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsalib-0.10.1.tar.gz (4.4 kB view details)

Uploaded Source

File details

Details for the file lsalib-0.10.1.tar.gz.

File metadata

  • Download URL: lsalib-0.10.1.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for lsalib-0.10.1.tar.gz
Algorithm Hash digest
SHA256 c4ae2afe9f03c116889e0a628bbc9935ff06b665a5905dc9442944a8ab85c449
MD5 d0d9fae5b703590849577f935b46793c
BLAKE2b-256 4693128e105f6e5c52658c34dca4ff99ed711464213f00c255d1d8e8667a1b0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page