A dictionary which values are automatically compressed to save memory.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: GNU General Public License v2 (GPLv2)
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

compressed-dictionary

A dictionary which values are compressed to save memory. No external library is required. Python 3 is required.

Is this for you?

The CompressedDictionary is useful when you have a large dictionary where values are, for example, strings of text, long lists of numbers or strings, dictionaries with many key-value pairs and so on. Using a CompressedDictionary to store int->int relations make no sense since the CompressedDictionary would result in a bigger memory occupancy.

The CompressedDictionary has some contraints:

keys must be integers (max key value is 2^32). You could also use strings or larger integers, but some functionalities may not work out-of-the-box.
values must be json serializable. This means that values can be integers, booleans, strings, floats and any combination of this types grouped in lists or dictionaries. You can test if a value is json serializable with json.dumps(object).

Install

Install with:

pip install compressed-dictionary

and remove with:

pip uninstall compressed-dictionary

How to use the `CompressedDictionary`

A CompressedDictionary is a python dictionary with some enhancements under the hood. When assigning a value to a key, the value is automatically serialized and compressed. The same applies when a value is extracted with a key from the dictionary.

>>> from create_pretraining_dataset.utils import CompressedDictionary
>>>
>>> d = CompressedDictionary()
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>>
>>> d[0] = {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> # use it like a normal dictionary
>>> # remember that keys are integers (to be better compatible with pytorch dataset indexing with integers)
>>> d[0]
{'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> for k in d.keys():
>>>     # do something with d[k]
>>>     print(k)
>>> # OR
>>> for k, value in d.items():
>>>     print(k, value) # print millions of entries is not always a good idea...
>>>
>>> # delete an entry
>>> del d[0]
>>>
>>> # get number of key-value pairs
>>> len(d)
1
>>>
>>> # access compressed data directly
>>> d._content[0]
b"3hbwuchbufbou&RFYUVGBKYU6T76\x00\x00" # the compressed byte array corresponding to the d[0] value
>>>
>>> # save the dict to disk
>>> d.dump("/path/to/new/dump.cd")
>>>
>>> # split the dict in a set of smaller ones
>>> d.update((i, d[0]) for i in range(5))
>>> res = d.split(parts=2, reset_keys=True, drop_last=False, shuffle=True) 
>>> # Notice: splits are returned as a generator
>>> # Notice: reset_keys will ensure that each resulting split has keys from 0 to len(split)-1
>>> # Notice: shuffle will shuffle keys (indexes) before splitting
>>>
>>> list(next(res).items())
[(0, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (1, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (2, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]})]
>>>
>>> list(next(res).items())
[(0, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]}), (1, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]})]
>>>
>>> list(next(res).items())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

The documentation for each method can be found in compressed_dictionary/compressed_dictionary.py.

Utilities

We provide some utilities to manage compressed-dictionarys from the command line.

Merge

Merge two dictionaries into a third one:

python -m compressed_dictionary.utils.merge --input-files <input-dict-1> <input-dict-2> <...> --output-file <resulting-dict>

If dictionaries have common keys, you can re-create the key index from 0 to the sum of the lengths of the dicts by using --reset-keys. If you want the resulting dict to use a different compression algorithm use --compression <xz|bz2|gzip>.

Split

Split a dictionary in many sub-dictionaries:

python -m compressed_dictionary.utils.split --input-file <input-dict> --output-folder <resulting-dicts-folder> --parts <number-of-parts>

This will create <number-of-parts> dictionaries into <resulting-dicts-folder>. If you want to specify the length of the splits you can use --parts-length <splits-length> instead of --parts. Use --drop-last if you don't want the last smaller dict when splitting.

If you want to reset the keys in the new dictionaries, use --reset-keys. If you want to shuffle values before splitting, use --shuffle. Finally, if you want to read only a part of the input dictionary, use --limit <number-of-key-value-pairs-to-read>.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: GNU General Public License v2 (GPLv2)
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.2.1

Mar 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compressed_dictionary-1.2.1.tar.gz (11.4 kB view details)

Uploaded Mar 14, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

compressed_dictionary-1.2.1-py3-none-any.whl (17.4 kB view details)

Uploaded Mar 14, 2021 Python 3

File details

Details for the file compressed_dictionary-1.2.1.tar.gz.

File metadata

Download URL: compressed_dictionary-1.2.1.tar.gz
Upload date: Mar 14, 2021
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for compressed_dictionary-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`d9c0d1893c274d34c524c638fc47f7e5be4704f60cdfcb1098dae0ffb1a8b7d6`
MD5	`8d34b1454f0eb52061190e9ce3a05385`
BLAKE2b-256	`4fc8c8d9e9962ade708bbdbbb8de652167fe0195512dfc84148f8df965d8d293`

See more details on using hashes here.

File details

Details for the file compressed_dictionary-1.2.1-py3-none-any.whl.

File metadata

Download URL: compressed_dictionary-1.2.1-py3-none-any.whl
Upload date: Mar 14, 2021
Size: 17.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for compressed_dictionary-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`adfe9f1de2cdbd3adb8cb088daafd1ec6cd2b8b980b6a9d049ffc85790e8245c`
MD5	`1d11ef7b20c8c4945e30a93ea38b7e17`
BLAKE2b-256	`aaed22f3f0e5d90dbaeee597e61883b7f37921084df084d89a402c7f2a88b298`

See more details on using hashes here.

compressed-dictionary 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

compressed-dictionary

Is this for you?

Install

How to use the `CompressedDictionary`

Utilities

Merge

Split

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

compressed-dictionary 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

compressed-dictionary

Is this for you?

Install

How to use the CompressedDictionary

Utilities

Merge

Split

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

How to use the `CompressedDictionary`