Skip to main content

Word Vectors

Project description

Word Vectors

PyPi Version Actions Status Code style: black

A fast light-weight library for loading (and saving) word vectors.

Reading

The default way to read in word vectors is to use read. This function will sniff the file to try to determine what kind of file it is. If this sniffing fails for some reason an exception is raised. Don't worry you can still read your vectors you just need to use the read function specific to the vector file format you are using.

File Types

There are a few common types of word vectors file formats used in the NLP community. The supported formats are described here.

GloVe

A simple vector file that is a plain text file. Each line is a word followed by the vectors. Each line has the word and the elements of the vectors separated by a space. This is both slow and space inefficient.

This can be read with the read_glove function.

Word2Vec

Text

A text format this is the same as the GloVe format except the first line is two numbers, the first number is the number of elements in the vocabulary and the second is the size of the vectors. These numbers are not very helpfully because often some of these vector files have the same word at multiple lines so pre-allocating your vectors based on these numbers doesn't really work. Like GloVe this is both slow and space inefficient.

This can be read with the read_w2v_text function.

Binary

A simple binary format where the first row is the number of items in the vocab and the size of the vectors. Each line after is a word followed by the vector as a binary string separated by a space. This format is compact but slow because you need to read a byte at a time to the find the end of each word.

This can be read with the read_w2v function.

Note: The popular fastText pretrained word vectors ship in both the text and binary formats used by word2vec.

Dense

This is my new format. It is a binary file where the first 12 bytes are the vocab size, vector size, and max length of a word as unsigned, little endian, ints. Then the words and vectors follow with the words padded to the max length and then the vector. This format is a little larger than the word2vec format but it is faster because the location of each item (both the words or the vectors) can be calculated quickly. It also allows the possibility of multithreaded reading. This format is smaller than the normal glove format.

This can be read with the read_dense function.

Writing

Each format has its own writing function what takes in the destination file name, the vocab, and the vectors. The available writers are the following:

  • write_glove
  • write_w2v_text
  • write_w2v
  • write_dense

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word-vectors-1.1.0.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

word_vectors-1.1.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file word-vectors-1.1.0.tar.gz.

File metadata

  • Download URL: word-vectors-1.1.0.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for word-vectors-1.1.0.tar.gz
Algorithm Hash digest
SHA256 791a109c8a5a4e25fa5af8cd21c79f222ac6a626ecbd89a548d9b1a8592d6929
MD5 4b7d68bb95bb0c59e4b0b2c19a4c0da3
BLAKE2b-256 bfe2562de2468413d4f7230e295f8d053752987b7b3dc0d9e27606b53a3076ff

See more details on using hashes here.

Provenance

File details

Details for the file word_vectors-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: word_vectors-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for word_vectors-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b0a4fc3031351c40058a5144c6854d67ff5728461cc92628a64f949cd372eb2
MD5 b6ec8352ef08a392adc6e2dfd806df27
BLAKE2b-256 867b2b303c1137f22cb133e51e84047918ae2c4025f1c4945c2926a6dac193a8

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page