Skip to main content

Run stable random projections.

Project description

pySRP

Python module implementing Stable Random Projections.

These create interchangeable, data-agnostic vectorized representations of text suitable for a variety of contexts.

You may want to use them in concert with the pre-distributed Hathi SRP features.

Installation

Python 3

pip3 install git+git://github.com/bmschmidt/pySRP.git

Python 2

pip install git+git://github.com/bmschmidt/pySRP.git

Usage

Examples

See the docs folder for some IPython notebooks demonstrating:

Basic Usage

Use the SRP class to build an object to perform transformations.

This is a class method, rather than a function, which builds a cache of previously seen words.

import SRP
# initialize with desired number of dimensions
hasher = SRP.SRP(640)

The most important method is 'stable_transform'.

This can tokenize and then compute the SRP.

hasher.stable_transform(words = "foo bar bar"))

If counts are already computed, word and count vectors can be passed separately.

hasher.stable_transform(words = ["foo","bar"],counts = [1,2])

Read/write tools

SRP files are stored in a binary file format to save space. This format is the same used by the binary word2vec format.

file = SRP.Vector_file("hathivectors.bin")

for (key,vector) in file:
  pass
  # 'key' is a unique identifier for a document in a corpus
  # 'vector' is a `numpy.array` of type `<f4`.

There are two other methods. One lets you read an entire matrix in at once. This may require lots of memory. It returns a dictionary with two keys: 'matrix' (a numpy array) and 'names' (the row names).

all = SRP.Vector_file("hathivectors.bin").to_matrix()
all['matrix'][:5]
all['names'][:5]

The other lets you treat the file as a dictionary of keys. The first lookup may take a very long time; subsequent lookups will be fast without requiring you to load the vectors into memory. To get a 1-dimensional representation of a book:

all = SRP.Vector_file("hathivectors.bin")
all['gri.ark:/13960/t3032jj3n']

You can also, thanks to Peter Organisciak, access multiple vectors at once this way by passing a list of identifiers. This returns a matrix with shape (2, 160) for a 160-dimensional representation.

all[['gri.ark:/13960/t3032jj3n', 'hvd.1234532123']]

Writing to SRP files

You can build your own files row by row.

# Note--the dimensions of the file and the hasher should be equal.
output = SRP.Vector_file("new_vectors.bin",dims=640,mode="w")
hasher = SRP.SRP(640)


for filename in [a,b,c,d]:
  hash = hasher.stable_transform(" ".join(open(filename).readlines()))
  output.add_row(filename,hash)

# files must be closed.
output.close()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysrp-1.1.0.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

pysrp-1.1.0-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file pysrp-1.1.0.tar.gz.

File metadata

  • Download URL: pysrp-1.1.0.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.8

File hashes

Hashes for pysrp-1.1.0.tar.gz
Algorithm Hash digest
SHA256 113ca04c1ad6ddd6e5bcaeaabc4f0499a6ec458431f667a02973b4870073376b
MD5 35ba5f34ea2977335c61de1685cfbac3
BLAKE2b-256 5cfcaebfe2f403fa511b42d2441a8a8ca23685bea97750dac36be6602184efb6

See more details on using hashes here.

File details

Details for the file pysrp-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pysrp-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.8

File hashes

Hashes for pysrp-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 317a56a8ad2e58faf6cafddc0ba11e6a0f9bc0f368a50cfbe05a3b7aad699afd
MD5 e2b4ac47593e921d9d2d6bc0f33de08b
BLAKE2b-256 c69fbb58e7b05257175964d8fabf022aa54d139d60cba0d98968d5d3a9362218

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page