A dictionary which values are automatically compressed to save memory.
Project description
compressed-dictionary
A dictionary which values are compressed to save memory. No external library is required. Python 3 is required.
Is this for you?
The CompressedDictionary is useful when you have a large dictionary where values are, for example, strings of text, long lists of numbers or strings, dictionaries with many key-value pairs and so on. Using a CompressedDictionary to store int->int relations make no sense since the CompressedDictionary would result in a bigger memory occupancy.
The CompressedDictionary has some contraints:
keysmust be integers (max key value is2^32). You could also use strings or larger integers, but some functionalities may not work out-of-the-box.valuesmust bejsonserializable. This means that values can be integers, booleans, strings, floats and any combination of this types grouped in lists or dictionaries. You can test if a value is json serializable withjson.dumps(object).
Install
Install with:
pip install compressed-dictionary
and remove with:
pip uninstall compressed-dictionary
How to use the CompressedDictionary
A CompressedDictionary is a python dictionary with some enhancements under the hood. When assigning a value to a key, the value is automatically serialized and compressed. The same applies when a value is extracted with a key from the dictionary.
>>> from create_pretraining_dataset.utils import CompressedDictionary
>>>
>>> d = CompressedDictionary()
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>>
>>> d[0] = {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> # use it like a normal dictionary
>>> # remember that keys are integers (to be better compatible with pytorch dataset indexing with integers)
>>> d[0]
{'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> for k in d.keys():
>>> # do something with d[k]
>>> print(k)
>>> # OR
>>> for k, value in d.items():
>>> print(k, value) # print millions of entries is not always a good idea...
>>>
>>> # delete an entry
>>> del d[0]
>>>
>>> # get number of key-value pairs
>>> len(d)
1
>>>
>>> # access compressed data directly
>>> d._content[0]
b"3hbwuchbufbou&RFYUVGBKYU6T76\x00\x00" # the compressed byte array corresponding to the d[0] value
>>>
>>> # save the dict to disk
>>> d.dump("/path/to/new/dump.cd")
>>>
>>> # split the dict in a set of smaller ones
>>> d.update((i, d[0]) for i in range(5))
>>> res = d.split(parts=2, reset_keys=True, drop_last=False, shuffle=True)
>>> # Notice: splits are returned as a generator
>>> # Notice: reset_keys will ensure that each resulting split has keys from 0 to len(split)-1
>>> # Notice: shuffle will shuffle keys (indexes) before splitting
>>>
>>> list(next(res).items())
[(0, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (1, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (2, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]})]
>>>
>>> list(next(res).items())
[(0, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]}), (1, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]})]
>>>
>>> list(next(res).items())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
The documentation for each method can be found in compressed_dictionary/compressed_dictionary.py.
Utilities
We provide some utilities to manage compressed-dictionarys from the command line.
Merge
Merge two dictionaries into a third one:
python -m compressed_dictionary.utils.merge --input-files <input-dict-1> <input-dict-2> <...> --output-file <resulting-dict>
If dictionaries have common keys, you can re-create the key index from 0 to the sum of the lengths of the dicts by using --reset-keys.
If you want the resulting dict to use a different compression algorithm use --compression <xz|bz2|gzip>.
Split
Split a dictionary in many sub-dictionaries:
python -m compressed_dictionary.utils.split --input-file <input-dict> --output-folder <resulting-dicts-folder> --parts <number-of-parts>
This will create <number-of-parts> dictionaries into <resulting-dicts-folder>. If you want to specify the length of the splits you can use --parts-length <splits-length> instead of --parts. Use --drop-last if you don't want the last smaller dict when splitting.
If you want to reset the keys in the new dictionaries, use --reset-keys. If you want to shuffle values before splitting, use --shuffle. Finally, if you want to read only a part of the input dictionary, use --limit <number-of-key-value-pairs-to-read>.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file compressed_dictionary-1.2.1.tar.gz.
File metadata
- Download URL: compressed_dictionary-1.2.1.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9c0d1893c274d34c524c638fc47f7e5be4704f60cdfcb1098dae0ffb1a8b7d6
|
|
| MD5 |
8d34b1454f0eb52061190e9ce3a05385
|
|
| BLAKE2b-256 |
4fc8c8d9e9962ade708bbdbbb8de652167fe0195512dfc84148f8df965d8d293
|
File details
Details for the file compressed_dictionary-1.2.1-py3-none-any.whl.
File metadata
- Download URL: compressed_dictionary-1.2.1-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adfe9f1de2cdbd3adb8cb088daafd1ec6cd2b8b980b6a9d049ffc85790e8245c
|
|
| MD5 |
1d11ef7b20c8c4945e30a93ea38b7e17
|
|
| BLAKE2b-256 |
aaed22f3f0e5d90dbaeee597e61883b7f37921084df084d89a402c7f2a88b298
|