Skip to main content

Injectively, deterministically maps arbitrary objects to hashable values

Project description

PyPI Package PyPI Downloads License Python Versions libraries.io sourcerank GitHub stars CI status Code Coverage GitHub last commit Checked with Mypy Code style: black

Injectively, deterministically maps arbitrary objects to hashable, immutable values

Quickstart

If you don’t have pip installed, see the pip install guide.

$ pip install charmonium.freeze

For a related project, charmonium.cache, I needed a function that deterministically, injectively maps objects to hashable objects.

  • “Injectively” means freeze(a) == freeze(b) implies a == b (with the precondition that a and b are of the same type).

  • “Deterministically” means it should return the same value across subsequent process invocations (with the same interpreter major and minor version), unlike Python’s hash function, which is not deterministic between processes.

  • “Hashable” means one can call hash(...) on it. All hashable values are immutable.

Have you ever felt like you wanted to “freeze” a list of arbitrary data into a hashable value? Now you can.

>>> obj = [1, 2, 3, {4, 5, 6}, object()]
>>> hash(obj)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> from charmonium.freeze import freeze
>>> freeze(obj)
9561766455304166758

Configuration

By changing the configuration, we can see the exact data that gets hashed.

We can change the configuration in a few ways:

  • Object-oriented (preferred)

    >>> from charmonium.freeze import Config
    >>> freeze(obj, Config(use_hash=False))
    (1, 2, 3, frozenset({4, 5, 6}), ((('builtins', 'object'),), b'copyreg.__newobj__'))
    
  • Global variable, but in this case, we must also clear the cache when we mutate the config.

    >>> from charmonium.freeze import global_config
    >>> global_config.use_hash = False
    >>> global_config.memo.clear()
    >>> freeze(obj)
    (1, 2, 3, frozenset({4, 5, 6}), ((('builtins', 'object'),), b'copyreg.__newobj__'))
    

use_hash=True will be faster and produce less data, but I will demonstrate it with use_hash=False so you can see what data gets included in the state.

See the source code charmonium/freeze/config.py for other configuration options.

Freezing Functions

freeze on functions returns their bytecode, constants, and closure-vars. The remarkable thing is that this is true across subsequent invocations of the same process. If the user edits the script and changes the function, then it’s freeze will change too. This tells you if it is safe to use the cached value of the function.

(freeze(f) == freeze(g)) implies (for all x, f(x) == g(x))
>>> from pprint import pprint
>>> i = 456
>>> func = lambda x: x + i + 123
>>> pprint(freeze(func))
(('<lambda>',
  None,
  123,
  b'\x97\x00|\x00t\x00\x00\x00\x00\x00\x00\x00\x00\x00z\x00\x00\x00d\x01'
  b'z\x00\x00\x00S\x00'),
 (('i', 456),))

As promised, the frozen value includes the bytecode (b'|x00t...), the constants (123), and the closure variables (456). When we change i, we get a different frozen value, indicating that the func might not be computationally equivalent to what it was before.

>>> i = 789
>>> pprint(freeze(func))
(('<lambda>',
  None,
  123,
  b'\x97\x00|\x00t\x00\x00\x00\x00\x00\x00\x00\x00\x00z\x00\x00\x00d\x01'
  b'z\x00\x00\x00S\x00'),
 (('i', 789),))

freeze works for objects that use function as data.

>>> import functools
>>> pprint(freeze(functools.partial(print, 123)))
(('print',),
 ('print', (123,), (), None),
 (frozenset({'partial',
             (...,
              ('args', (b'member_descriptor', b'args')),
              ('func', (b'member_descriptor', b'func')),
              ('keywords', (b'member_descriptor', b'keywords')))}),
  ('builtins', 'object')))

freeze works for methods.

>>> class Greeter:
...     def __init__(self, greeting):
...         self.greeting = greeting
...     def greet(self, name):
...         print(self.greeting + " " + name)
...
>>> pprint(freeze(Greeter.greet))
(('greet',
  None,
  ' ',
  b'\x97\x00t\x01\x00\x00\x00\x00\x00\x00\x00\x00|\x00j\x02\x00\x00\x00\x00'
  b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00d\x01z\x00\x00\x00'
  b'|\x01z\x00\x00\x00\xab\x01\x00\x00\x00\x00\x00\x00\x01\x00y\x00'),)

Freezing Objects

freeze works on objects by freezing their state and freezing their methods. The state is found by the pickle protocol, which the Python language implements by default for all classes. To get an idea of what this returns, call obj.__reduce_ex__(4). Because we reuse an existing protocol, freeze work correctly on most user-defined types.

>>> s = Greeter("hello")
>>> pprint(s.__reduce_ex__(4))
(<function __newobj__ at 0x...>,
 (<class '__main__.Greeter'>,),
 {'greeting': 'hello'},
 None,
 None)
>>> pprint(freeze(s))
(((frozenset({'Greeter',
              (('__init__',
                (('__init__', None, b'|\x01|\x00_\x00d\x00S\x00'),)),
               ('greet',
                (('greet',
                  None,
                  ' ',
                  b't\x00|\x00j\x01d\x01\x17\x00|\x01\x17\x00\x83\x01'
                  b'\x01\x00d\x00S\x00'),)))}),
   ('builtins', 'object')),),
 (('greeting', 'hello'),),
 b'copyreg.__newobj__')

However, there can still be special cases: pickle may incorporate non-deterministic values. In this case, there are three remedies:

  • If you can tweak the definition of the class, add a method called __getfrozenstate__ which returns a deterministic snapshot of the state. This takes precedence over the Pickle protocol, if it is defined.

    >>> class Greeter:
    ...     def __init__(self, greeting):
    ...         self.greeting = greeting
    ...     def greet(self, name):
    ...         print(self.greeting + " " + name)
    ...     def __getfrozenstate__(self):
    ...         return self.greeting
    ...
    >>> pprint(freeze(Greeter("hello")))
    ((frozenset({'Greeter',
                 (('__getfrozenstate__',
                   (('__getfrozenstate__', None, b'|\x00j\x00S\x00'),)),
                  ('__init__', (('__init__', None, b'|\x01|\x00_\x00d\x00S\x00'),)),
                  ('greet',
                   (('greet',
                     None,
                     ' ',
                     b't\x00|\x00j\x01d\x01\x17\x00|\x01\x17\x00\x83\x01'
                     b'\x01\x00d\x00S\x00'),)))}),
      ('builtins', 'object')),
     'hello')
    
  • Otherwise, you can ignore certain attributes by changing the configuration. See the source code of charmonium/freeze/config.py for more details.

    >>> class Greeter:
    ...     def __init__(self, greeting):
    ...         self.greeting = greeting
    ...     def greet(self, name):
    ...         print(self.greeting + " " + name)
    ...
    >>> config = Config(use_hash=False)
    >>> config.ignore_attributes.add(("__main__", "Greeter", "greeting"))
    >>> pprint(freeze(Greeter("hello"), config))
    (((frozenset({'Greeter',
                  (('__init__',
                    (('__init__', None, b'|\x01|\x00_\x00d\x00S\x00'),)),
                   ('greet',
                    (('greet',
                      None,
                      ' ',
                      b't\x00|\x00j\x01d\x01\x17\x00|\x01\x17\x00\x83\x01'
                      b'\x01\x00d\x00S\x00'),)))}),
       ('builtins', 'object')),),
     (),
     b'copyreg.__newobj__')
    

    Note that 'hello' is not present in the frozen object any more.

  • If you cannot tweak the definition of the class or monkeypatch a __getfrozenstate__ method, you can still register single dispatch handler for that type:

    >>> from typing import Hashable, Optional, Dict, Tuple
    >>> from charmonium.freeze import _freeze_dispatch, _freeze
    >>> @_freeze_dispatch.register(Greeter)
    ... def _(
    ...         obj: Greeter,
    ...         config: Config,
    ...         tabu: Dict[int, Tuple[int, int]],
    ...         level: int,
    ...         index: int,
    ...     ) -> Tuple[Hashable, bool, Optional[int]]:
    ...     # Type annotations are optional.
    ...     # I have included them here for clarity.
    ...
    ...     # `tabu` is for object cycle detection. It is handled for you.
    ...     # `level` is for logging and recursion limits. It is incremented for you.
    ...     # `index` is the "birth order" of the children.
    ...     frozen_greeting = _freeze(obj.greeting, config, tabu, level, 0)
    ...
    ...     return (
    ...         frozen_greeting[0],
    ...         # Remember that _freeze returns a triple;
    ...         # we are only interested in the first element here.
    ...
    ...         False,
    ...         # Whether the obj is immutable
    ...         # If the obj is immutable, it's frozen value need not be recomputed every time.
    ...         # This is handled for you.
    ...
    ...         None,
    ...         # The depth of references contained here or None
    ...         # Currently, this doesn't do anything.
    ...     )
    ...
    >>> freeze(Greeter("Hello"))
    'Hello'
    

Dictionary order

As of Python 3.7, dictionaries “remember” their insertion order. As such,

>>> freeze({"a": 1, "b": 2})
(('a', 1), ('b', 2))
>>> freeze({"b": 2, "a": 1})
(('b', 2), ('a', 1))

This behavior is controllable by Config.ignore_dict_order, which emits a frozenset of pairs.

>>> config = Config(ignore_dict_order=True)
>>> freeze({"b": 2, "a": 1}, config) == freeze({"a": 1, "b": 2}, config)
True

Summarize diff

This enables a pretty neat utility to compare two arbitrary Python objects.

>>> from charmonium.freeze import summarize_diffs
>>> obj0 = [0, 1, 2, {3, 4}, {"a": 5, "b": 6, "c": 7}, 8]
>>> obj1 = [0, 8, 2, {3, 5}, {"a": 5, "b": 7, "d": 8}]
>>> print(summarize_diffs(obj0, obj1))
let obj0_sub = obj0
let obj1_sub = obj1
obj0_sub.__len__() == 6
obj1_sub.__len__() == 5
obj0_sub[1] == 1
obj1_sub[1] == 8
obj0_sub[3].has() == 4
obj1_sub[3].has() == no such element
obj0_sub[3].has() == no such element
obj1_sub[3].has() == 5
obj0_sub[4].keys().has() == c
obj1_sub[4].keys().has() == no such element
obj0_sub[4].keys().has() == no such element
obj1_sub[4].keys().has() == d
obj0_sub[4]['b'] == 6
obj1_sub[4]['b'] == 7

And if you don’t like my printing style, you can get a programatic access to this information.

>>> from charmonium.freeze import iterate_diffs
>>> for o1, o2 in iterate_diffs(obj0, obj1):
...    print(o1, o2, sep="\n")
ObjectLocation(labels=('obj0', '.__len__()'), objects=(..., 6))
ObjectLocation(labels=('obj1', '.__len__()'), objects=(..., 5))
ObjectLocation(labels=('obj0', '[1]'), objects=(..., 1))
ObjectLocation(labels=('obj1', '[1]'), objects=(..., 8))
ObjectLocation(labels=('obj0', '[3]', '.has()'), objects=(...), 4))
ObjectLocation(labels=('obj1', '[3]', '.has()'), objects=(..., 'no such element'))
ObjectLocation(labels=('obj0', '[3]', '.has()'), objects=(...), 'no such element'))
ObjectLocation(labels=('obj1', '[3]', '.has()'), objects=(..., 5))
ObjectLocation(labels=('obj0', '[4]', '.keys()', '.has()'), objects=(..., 'c'))
ObjectLocation(labels=('obj1', '[4]', '.keys()', '.has()'), objects=(..., 'no such element'))
ObjectLocation(labels=('obj0', '[4]', '.keys()', '.has()'), objects=(..., 'no such element'))
ObjectLocation(labels=('obj1', '[4]', '.keys()', '.has()'), objects=(..., 'd'))
ObjectLocation(labels=('obj0', '[4]', "['b']"), objects=(..., 6))
ObjectLocation(labels=('obj1', '[4]', "['b']"), objects=(..., 7))

Debugging

Use the following lines to see how freeze decomposes an object into primitive values.

import logging, os
logger = logging.getLogger("charmonium.freeze")
logger.setLevel(logging.DEBUG)
fh = logging.FileHandler("freeze.log")
fh.setLevel(logging.DEBUG)
fh.setFormatter(logging.Formatter("%(message)s"))
logger.addHandler(fh)
logger.debug("Program %d", os.getpid())

i = 0
def square_plus_i(x):
    # Value of global variable will be included in the function's frozen state.
    return x**2 + i

from charmonium.freeze import freeze
freeze(square_plus_i)

This produces a log such as in freeze.log:

freeze begin <function square_plus_i at 0x7f9228bff550>
 function <function square_plus_i at 0x7f9228bff550>
  tuple (('code', <code object square_plus_i at 0x7f9228c6cf50, file "/tmp/ipython_edit_303agyiz/ipython_edit_rez33yf_.py", line 2>), 'closure globals', {'i': 0})
   tuple ('code', <code object square_plus_i at 0x7f9228c6cf50, file "/tmp/ipython_edit_303agyiz/ipython_edit_rez33yf_.py", line 2>)
    'code'
    code <code object square_plus_i at 0x7f9228c6cf50, file "/tmp/ipython_edit_303agyiz/ipython_edit_rez33yf_.py", line 2>
     tuple (None, 2)
      None
      2
     b'|\x00d\x01\x13\x00t\x00\x17\x00S\x00'
   'closure globals'
   dict {'i': 0}
    'i'
    0
freeze end

I do this to find the differences between subsequent runs:

$ python code.py
$ mv freeze.log freeze.0.log

$ python code.py
$ mv freeze.log freeze.1.log

$ sed -i -e 's/at 0x[0-9a-f]*/ptr/g' -e 's/memo hit for [0-9]*/memo hit/g' -e 's/tabu hit for [0-9]*/tabu hit/g' freeze.*.log
# This removes pointer values that appear in the `repr(...)`.

$ meld freeze.0.log freeze.1.log
# Alternatively, use `icdiff` or `diff -u1`.

If freeze(obj) is taking a long time, try adding freeze(obj, Config(recursion_limit=20)). This causes an exception if freeze recurses more than a certain number of times. If you hit this exception, consider adding ignored class, functions, attributes, or objects in Config.

Developing

See CONTRIBUTING.md for instructions on setting up a development environment.

TODO

  • ☐ Correctness

    • ☑ Test hashing sets with different orders. Assert tests fail.

    • ☑ Test hashing dicts with different orders. Assert tests fail.

    • ☑ Don’t include properties in hash.

    • ☑ Test that freeze of an object includes freeze of its instance methods.

    • ☑ Test functions with minor changes.

    • ☑ Test set/dict with diff hash.

    • ☑ Test obj with slots.

    • ☑ Test hash for objects and classes more carefully.

    • ☑ Improve test coverage.

    • ☑ Investigate when modules are assumed constant.

    • ☐ Detect if a module/package has a version. If present, use that. Else, use each attribute.

    • ☐ Support closures which include import x and from x import y

  • ☑ API

    • ☑ Use user-customizable multidispatch.

    • ☑ Bring hash into separate package.

    • ☑ Make it easier to register a freeze method for a type.

    • ☑ Encapsulate global config into object.

    • ☑ Make freeze object-oriented with a module-level instance, like random.random and random.Random. - This makes it easier for different callers to have their own configuration options.

    • ☑ Add an option which returns a single 128-bit int instead of a structured object after a certain depth. This is what charmonium.determ_hash does. Use this configuration in charmonium.cache.

    • ☐ Move “get call graph” into its own package.

    • ☐ Document configuration options.

    • ☑ Document summarize_diff and iterate_diffs.

    • ☑ Config object should cascade with with config.set(a=b)

    • ☐ Use the class name when deciding whether to ignore a method (currently we just use the methods __module__ and __name__).

    • ☐ Support ~__getfrozenstate_for_type__~

    • ☐ Have a helper that can annotate methods, classes, and functions with ~__getfrozenstate__~.

    • ☐ Make it possible to ignore classes and functions by their package, rather than just module.

    • ☐ Have an API for ignoring modules in requirements.txt or pyproject.toml, and just tracking them by version.

  • ☑ Make freeze handle more types:

    • ☑ Module: freeze by name.

    • ☑ Objects: include the source-code of methods.

    • ☑ C extensions. freeze by name, like module

    • ☑ Methods

    • ☑ fastpath for numpy arrays

    • tqdm

    • numpy.int64(1234)

    • ☑ Pandas dataframe

    • ☑ Catch Pickle TypeError

    • ☑ Catch Pickle ImportError

  • ☐ Performance

    • ☑ Memoize the hash of immutable data: - If function contains no locals or globals except other immutables, it is immutable. - If a collection is immutable and contains only immutables, it is immutable.

    • ☑ Make performance benchmarks.

    • ☐ Consider making a fast-path for dataclasses.

    • ☐ Consider a config option that states “all classes/functions from X.Y are stateless.”
      • For example, most of the standard library is stateless (e.g., re)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charmonium_freeze-0.8.4.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

charmonium_freeze-0.8.4-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file charmonium_freeze-0.8.4.tar.gz.

File metadata

  • Download URL: charmonium_freeze-0.8.4.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.10 Linux/6.12.28

File hashes

Hashes for charmonium_freeze-0.8.4.tar.gz
Algorithm Hash digest
SHA256 ad082f43a55c74e47ef4ee6b551ca95b50a73b03f47a2e41c34cd84914742579
MD5 6d7e5b887bc500e71c69f2d35cc58d1b
BLAKE2b-256 c95bcfda6ae665e73e10b5c874ce9d6df957644b439edb865302af8a684629c2

See more details on using hashes here.

File details

Details for the file charmonium_freeze-0.8.4-py3-none-any.whl.

File metadata

  • Download URL: charmonium_freeze-0.8.4-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.10 Linux/6.12.28

File hashes

Hashes for charmonium_freeze-0.8.4-py3-none-any.whl
Algorithm Hash digest
SHA256 81c7d2e55816f4d3ce2cff2a30918de92face1eaf8458c555d6932a166743491
MD5 f115bce7032026b61ec467bfb1b8836f
BLAKE2b-256 0c157f8ea07d4b8347849496fc57bccdc5c1454e1dfd2003472e1190fb10913f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page