Run stable random projections.
Project description
pySRP
Python module implementing Stable Random Projections.
These create interchangeable, data-agnostic vectorized representations of text suitable for a variety of contexts.
You may want to use them in concert with the pre-distributed Hathi SRP features.
Installation
Python 3
pip3 install git+git://github.com/bmschmidt/pySRP.git
Python 2
pip install git+git://github.com/bmschmidt/pySRP.git
Usage
Examples
See the docs folder for some IPython notebooks demonstrating:
- Taking a subset of the full Hathi collection (100,000 works of fiction) based on identifiers, and exploring the major clusters within fiction.
- Creating a new SRP representation of text files and plotting dimensionality reductions of them by language and time
- Searching for copies of one set of books in the full HathiTrust collection, and using Hathi metadata to identify duplicates and find errors in local item descriptions.
- Training a classifier based on library metadata using TensorFlow, and then applyinig that classification to other sorts of text.
Basic Usage
Use the SRP class to build an object to perform transformations.
This is a class method, rather than a function, which builds a cache of previously seen words.
import SRP
# initialize with desired number of dimensions
hasher = SRP.SRP(640)
The most important method is 'stable_transform'.
This can tokenize and then compute the SRP.
hasher.stable_transform(words = "foo bar bar"))
If counts are already computed, word and count vectors can be passed separately.
hasher.stable_transform(words = ["foo","bar"],counts = [1,2])
Read/write tools
SRP files are stored in a binary file format to save space. This format is the same used by the binary word2vec format.
file = SRP.Vector_file("hathivectors.bin")
for (key,vector) in file:
pass
# 'key' is a unique identifier for a document in a corpus
# 'vector' is a `numpy.array` of type `<f4`.
There are two other methods. One lets you read an entire matrix in at once. This may require lots of memory. It returns a dictionary with two keys: 'matrix' (a numpy array) and 'names' (the row names).
all = SRP.Vector_file("hathivectors.bin").to_matrix()
all['matrix'][:5]
all['names'][:5]
The other lets you treat the file as a dictionary of keys. The first lookup may take a very long time; subsequent lookups will be fast without requiring you to load the vectors into memory. To get a 1-dimensional representation of a book:
all = SRP.Vector_file("hathivectors.bin")
all['gri.ark:/13960/t3032jj3n']
You can also, thanks to Peter Organisciak, access multiple vectors at once this way by passing a list of identifiers. This returns a matrix with shape (2, 160) for a 160-dimensional representation.
all[['gri.ark:/13960/t3032jj3n', 'hvd.1234532123']]
Writing to SRP files
You can build your own files row by row.
# Note--the dimensions of the file and the hasher should be equal.
output = SRP.Vector_file("new_vectors.bin",dims=640,mode="w")
hasher = SRP.SRP(640)
for filename in [a,b,c,d]:
hash = hasher.stable_transform(" ".join(open(filename).readlines()))
output.add_row(filename,hash)
# files must be closed.
output.close()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pysrp-1.1.0.tar.gz
.
File metadata
- Download URL: pysrp-1.1.0.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 113ca04c1ad6ddd6e5bcaeaabc4f0499a6ec458431f667a02973b4870073376b |
|
MD5 | 35ba5f34ea2977335c61de1685cfbac3 |
|
BLAKE2b-256 | 5cfcaebfe2f403fa511b42d2441a8a8ca23685bea97750dac36be6602184efb6 |
File details
Details for the file pysrp-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: pysrp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 317a56a8ad2e58faf6cafddc0ba11e6a0f9bc0f368a50cfbe05a3b7aad699afd |
|
MD5 | e2b4ac47593e921d9d2d6bc0f33de08b |
|
BLAKE2b-256 | c69fbb58e7b05257175964d8fabf022aa54d139d60cba0d98968d5d3a9362218 |