A dictionary which values are automatically compressed to save memory.
Project description
compressed-dictionary
A dictionary which values are compressed to save memory. No external library is required. Python 3 is required.
Is this for you?
The CompressedDictionary
is useful when you have a large dictionary where values are, for example, strings of text, long lists of numbers or strings, dictionaries with many key-value pairs and so on. Using a CompressedDictionary
to store int->int
relations make no sense since the CompressedDictionary
would result in a bigger memory occupancy.
The CompressedDictionary
has some contraints:
keys
must be integers (max key value is2^32
). You could also use strings or larger integers, but some functionalities may not work out-of-the-box.values
must bejson
serializable. This means that values can be integers, booleans, strings, floats and any combination of this types grouped in lists or dictionaries. You can test if a value is json serializable withjson.dumps(object)
.
Install
Install with:
pip install compressed-dictionary
and remove with:
pip uninstall compressed-dictionary
How to use the CompressedDictionary
A CompressedDictionary
is a python dictionary with some enhancements under the hood. When assigning a value to a key, the value is automatically serialized and compressed. The same applies when a value is extracted with a key from the dictionary.
>>> from create_pretraining_dataset.utils import CompressedDictionary
>>>
>>> d = CompressedDictionary()
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>>
>>> d[0] = {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> # use it like a normal dictionary
>>> # remember that keys are integers (to be better compatible with pytorch dataset indexing with integers)
>>> d[0]
{'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> for k in d.keys():
>>> # do something with d[k]
>>> print(k)
>>> # OR
>>> for k, value in d.items():
>>> print(k, value) # print millions of entries is not always a good idea...
>>>
>>> # delete an entry
>>> del d[0]
>>>
>>> # get number of key-value pairs
>>> len(d)
1
>>>
>>> # access compressed data directly
>>> d._content[0]
b"3hbwuchbufbou&RFYUVGBKYU6T76\x00\x00" # the compressed byte array corresponding to the d[0] value
>>>
>>> # save the dict to disk
>>> d.dump("/path/to/new/dump.cd")
>>>
>>> # split the dict in a set of smaller ones
>>> d.update((i, d[0]) for i in range(5))
>>> res = d.split(parts=2, reset_keys=True, drop_last=False, shuffle=True)
>>> # Notice: splits are returned as a generator
>>> # Notice: reset_keys will ensure that each resulting split has keys from 0 to len(split)-1
>>> # Notice: shuffle will shuffle keys (indexes) before splitting
>>>
>>> list(next(res).items())
[(0, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (1, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (2, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]})]
>>>
>>> list(next(res).items())
[(0, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]}), (1, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]})]
>>>
>>> list(next(res).items())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
The documentation for each method can be found in compressed_dictionary/compressed_dictionary.py
.
Utilities
We provide some utilities to manage compressed-dictionary
s from the command line.
Merge
Merge two dictionaries into a third one:
python -m compressed_dictionary.utils.merge --input-files <input-dict-1> <input-dict-2> <...> --output-file <resulting-dict>
If dictionaries have common keys, you can re-create the key index from 0
to the sum of the lengths of the dicts by using --reset-keys
.
If you want the resulting dict to use a different compression algorithm use --compression <xz|bz2|gzip>
.
Split
Split a dictionary in many sub-dictionaries:
python -m compressed_dictionary.utils.split --input-file <input-dict> --output-folder <resulting-dicts-folder> --parts <number-of-parts>
This will create <number-of-parts>
dictionaries into <resulting-dicts-folder>
. If you want to specify the length of the splits you can use --parts-length <splits-length>
instead of --parts
. Use --drop-last
if you don't want the last smaller dict when splitting.
If you want to reset the keys in the new dictionaries, use --reset-keys
. If you want to shuffle values before splitting, use --shuffle
. Finally, if you want to read only a part of the input dictionary, use --limit <number-of-key-value-pairs-to-read>
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for compressed_dictionary-1.2.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9c0d1893c274d34c524c638fc47f7e5be4704f60cdfcb1098dae0ffb1a8b7d6 |
|
MD5 | 8d34b1454f0eb52061190e9ce3a05385 |
|
BLAKE2b-256 | 4fc8c8d9e9962ade708bbdbbb8de652167fe0195512dfc84148f8df965d8d293 |
Hashes for compressed_dictionary-1.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | adfe9f1de2cdbd3adb8cb088daafd1ec6cd2b8b980b6a9d049ffc85790e8245c |
|
MD5 | 1d11ef7b20c8c4945e30a93ea38b7e17 |
|
BLAKE2b-256 | aaed22f3f0e5d90dbaeee597e61883b7f37921084df084d89a402c7f2a88b298 |