Skip to main content

A dictionary which values are automatically compressed to save memory.

Project description

compressed-dictionary

A dictionary which values are compressed to save memory. No external library is required. Python 3 is required.

Is this for you?

The CompressedDictionary is useful when you have a large dictionary where values are, for example, strings of text, long lists of numbers or strings, dictionaries with many key-value pairs and so on. Using a CompressedDictionary to store int->int relations make no sense since the CompressedDictionary would result in a bigger memory occupancy.

The CompressedDictionary has some contraints:

  • keys must be integers (max key value is 2^32). You could also use strings or larger integers, but some functionalities may not work out-of-the-box.
  • values must be json serializable. This means that values can be integers, booleans, strings, floats and any combination of this types grouped in lists or dictionaries. You can test if a value is json serializable with json.dumps(object).

Install

Install with:

pip install compressed-dictionary

and remove with:

pip uninstall compressed-dictionary

How to use the CompressedDictionary

A CompressedDictionary is a python dictionary with some enhancements under the hood. When assigning a value to a key, the value is automatically serialized and compressed. The same applies when a value is extracted with a key from the dictionary.

>>> from create_pretraining_dataset.utils import CompressedDictionary
>>>
>>> d = CompressedDictionary()
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>> # OR
>>> d = CompressedDictionary.load("/path/to/file")
>>>
>>> d[0] = {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> # use it like a normal dictionary
>>> # remember that keys are integers (to be better compatible with pytorch dataset indexing with integers)
>>> d[0]
{'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}
>>>
>>> for k in d.keys():
>>>     # do something with d[k]
>>>     print(k)
>>> # OR
>>> for k, value in d.items():
>>>     print(k, value) # print millions of entries is not always a good idea...
>>>
>>> # delete an entry
>>> del d[0]
>>>
>>> # get number of key-value pairs
>>> len(d)
1
>>>
>>> # access compressed data directly
>>> d._content[0]
b"3hbwuchbufbou&RFYUVGBKYU6T76\x00\x00" # the compressed byte array corresponding to the d[0] value
>>>
>>> # save the dict to disk
>>> d.dump("/path/to/new/dump.cd")
>>>
>>> # split the dict in a set of smaller ones
>>> d.update((i, d[0]) for i in range(5))
>>> res = d.split(parts=2, reset_keys=True, drop_last=False, shuffle=True) 
>>> # Notice: splits are returned as a generator
>>> # Notice: reset_keys will ensure that each resulting split has keys from 0 to len(split)-1
>>> # Notice: shuffle will shuffle keys (indexes) before splitting
>>>
>>> list(next(res).items())
[(0, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (1, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]}), (2, {'value_1': [1, 2, 3, 4], 'value_2': [1.0, 1.0, 1.0, 1.0], 'value_3': ["hi", "I", "am", "Luca"], 'value_4': [True, False, True, True]})]
>>>
>>> list(next(res).items())
[(0, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]}), (1, {'input_ids': [1, 2, 3, 4], 'attention_mask': [1, 1, 1, 1], 'token_type_ids': [0, 0, 1, 1], 'words_tails': [True, False, True, True]})]
>>>
>>> list(next(res).items())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

The documentation for each method can be found in compressed_dictionary/compressed_dictionary.py.

Utilities

We provide some utilities to manage compressed-dictionarys from the command line.

Merge

Merge two dictionaries into a third one:

python -m compressed_dictionary.utils.merge --input-files <input-dict-1> <input-dict-2> <...> --output-file <resulting-dict>

If dictionaries have common keys, you can re-create the key index from 0 to the sum of the lengths of the dicts by using --reset-keys. If you want the resulting dict to use a different compression algorithm use --compression <xz|bz2|gzip>.

Split

Split a dictionary in many sub-dictionaries:

python -m compressed_dictionary.utils.split --input-file <input-dict> --output-folder <resulting-dicts-folder> --parts <number-of-parts>

This will create <number-of-parts> dictionaries into <resulting-dicts-folder>. If you want to specify the length of the splits you can use --parts-length <splits-length> instead of --parts. Use --drop-last if you don't want the last smaller dict when splitting.

If you want to reset the keys in the new dictionaries, use --reset-keys. If you want to shuffle values before splitting, use --shuffle. Finally, if you want to read only a part of the input dictionary, use --limit <number-of-key-value-pairs-to-read>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compressed_dictionary-1.2.1.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

compressed_dictionary-1.2.1-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file compressed_dictionary-1.2.1.tar.gz.

File metadata

  • Download URL: compressed_dictionary-1.2.1.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for compressed_dictionary-1.2.1.tar.gz
Algorithm Hash digest
SHA256 d9c0d1893c274d34c524c638fc47f7e5be4704f60cdfcb1098dae0ffb1a8b7d6
MD5 8d34b1454f0eb52061190e9ce3a05385
BLAKE2b-256 4fc8c8d9e9962ade708bbdbbb8de652167fe0195512dfc84148f8df965d8d293

See more details on using hashes here.

File details

Details for the file compressed_dictionary-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: compressed_dictionary-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for compressed_dictionary-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 adfe9f1de2cdbd3adb8cb088daafd1ec6cd2b8b980b6a9d049ffc85790e8245c
MD5 1d11ef7b20c8c4945e30a93ea38b7e17
BLAKE2b-256 aaed22f3f0e5d90dbaeee597e61883b7f37921084df084d89a402c7f2a88b298

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page