A dictionary which values are automatically compressed to save memory.
Project description
compressed-dictionary
A dictionary which values are compressed to save memory. No external library is required. Python 3 is required.
Is this for you?
The CompressedDictionary
is useful when you have a large dictionary where values are, for example, strings of text, long lists of numbers or strings, dictionaries with many key-value pairs and so on. Using a CompressedDictionary
to store int->int
relations make no sense since the CompressedDictionary
would result in a bigger memory occupancy.
The CompressedDictionary
has some contraints:
keys
must be integers (max key value is2^32
). You could also use strings or larger integers, but some functionalities may not work out-of-the-box.values
must bejson
serializable. This means that values can be integers, booleans, strings, floats and any combination of this types grouped in lists or dictionaries. You can test if a value is json serializable withjson.dumps(object)
.
Install
Install with:
pip install compressed-dictionary
and remove with:
pip uninstall compressed-dictionary
How to use the CompressedDictionary
A CompressedDictionary
is a python dictionary with some enhancements under the hood. When assigning a value to a key, the value is automatically serialized and compressed. The same applies when a value is extracted with a key from the dictionary.
>>> from create_pretraining_dataset.utils import CompressedDictionary
>>>
>>> d = CompressedDictionary()
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>>
>>> d[0] = {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> # use it like a normal dictionary
>>> # remember that keys are integers (to be better compatible with pytorch dataset indexing with integers)
>>> d[0]
{'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> for k in d.keys():
>>> # do something with d[k]
>>> print(k)
>>> # OR
>>> for k, value in d.items():
>>> print(k, value) # print millions of entries is not always a good idea...
>>>
>>> # delete an entry
>>> del d[0]
>>>
>>> # get number of key-value pairs
>>> len(d)
1
>>>
>>> # access compressed data directly
>>> d._content[0]
b"3hbwuchbufbou&RFYUVGBKYU6T76\x00\x00" # the compressed byte array corresponding to the d[0] value
>>>
>>> # save the dict to disk
>>> d.dump("/path/to/new/dump.cd")
>>>
>>> # split the dict in a set of smaller ones
>>> d.update((i, d[0]) for i in range(5))
>>> res = d.split(parts=2, reset_keys=True, drop_last=False, shuffle=True)
>>> # Notice: splits are returned as a generator
>>> # Notice: reset_keys will ensure that each resulting split has keys from 0 to len(split)-1
>>> # Notice: shuffle will shuffle keys (indexes) before splitting
>>>
>>> list(next(res).items())
[(0, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (1, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (2, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]})]
>>>
>>> list(next(res).items())
[(0, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]}), (1, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]})]
>>>
>>> list(next(res).items())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
The documentation for each method can be found in compressed_dictionary/compressed_dictionary.py
.
Utilities
We provide some utilities to manage compressed-dictionary
s from the command line.
Merge
Merge two dictionaries into a third one:
python -m compressed_dictionary.utils.merge --input-files <input-dict-1> <input-dict-2> <...> --output-file <resulting-dict>
If dictionaries have common keys, you can re-create the key index from 0
to the sum of the lengths of the dicts by using --reset-keys
.
If you want the resulting dict to use a different compression algorithm use --compression <xz|bz2|gzip>
.
Split
Split a dictionary in many sub-dictionaries:
python -m compressed_dictionary.utils.split --input-file <input-dict> --output-folder <resulting-dicts-folder> --parts <number-of-parts>
This will create <number-of-parts>
dictionaries into <resulting-dicts-folder>
. If you want to specify the length of the splits you can use --parts-length <splits-length>
instead of --parts
. Use --drop-last
if you don't want the last smaller dict when splitting.
If you want to reset the keys in the new dictionaries, use --reset-keys
. If you want to shuffle values before splitting, use --shuffle
. Finally, if you want to read only a part of the input dictionary, use --limit <number-of-key-value-pairs-to-read>
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file compressed_dictionary-1.2.1.tar.gz
.
File metadata
- Download URL: compressed_dictionary-1.2.1.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9c0d1893c274d34c524c638fc47f7e5be4704f60cdfcb1098dae0ffb1a8b7d6 |
|
MD5 | 8d34b1454f0eb52061190e9ce3a05385 |
|
BLAKE2b-256 | 4fc8c8d9e9962ade708bbdbbb8de652167fe0195512dfc84148f8df965d8d293 |
File details
Details for the file compressed_dictionary-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: compressed_dictionary-1.2.1-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | adfe9f1de2cdbd3adb8cb088daafd1ec6cd2b8b980b6a9d049ffc85790e8245c |
|
MD5 | 1d11ef7b20c8c4945e30a93ea38b7e17 |
|
BLAKE2b-256 | aaed22f3f0e5d90dbaeee597e61883b7f37921084df084d89a402c7f2a88b298 |