Hash array snapshot
Hash Array Snapshot
This project provides a solution to the following two problems:
- Dataset revisions are seldom tracked effectively but can be just as important to track as source code revisions
- Version control systems such as Git do not handle large files (i.e. > 1GB) efficiently and struggle to track revisions of large datasets
Hash array snapshots solve these two problems because they are easily tracked by Git and provide a record of the data files associated with a specific source code revision.
Suppose we have a dataset composed of images stored in folders.
-dataset -square square0.png square1.png square2.png -triangle triangle0.png triangle1.png triangle2.png
$ has snap -d dataset
Will create snapshot.has with the following contents:
34dc214a2aea8d7c254a9d6dc351e0d3c0088ad998eed6053b78877785fcdff1:triangle/triangle0.png 566f5fa0703f5c2877c38fb3aae0fabbc5f9cdb25499b4f03ca75a6eb3827961:square/square0.png 67240c2cee6e9c77df1192890b1cf4deb265a5a6afdb4a5ecc03e93cc5889cef:triangle/triangle2.png dfb6352f5d42793b58ac74f2cacf5f1f82bdb1470a30941224a0f1e34766aeb4:square/square2.png e361db7913f495dafee06657ea67043a49c06fa1a3c57d3ed5b1a9048455de8f:square/square1.png f7994454bf5a880c5741b3af8e0ababf77f8c450fe47ed8b5c6f7b9d38c9115f:triangle/triangle1.png
Sometime later, additional circle data is added to our dataset and the overall naming convention is changed.
-dataset -square square_a.png square_b.png square_c.png -triangle triangle_a.png triangle_b.png triangle_c.png -circle circle_a.png circle_b.png circle_c.png
We can use has check to verify the dataset is different from what we recorded in our snapshot.
$ has check -d dataset
181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b:circle/circle_c.png 61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6:circle/circle_a.png e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855:circle/circle_b.png
Even though the triangle and square images have been renamed, files are tracked according to hash value rather than filename so those files are not seen as being new.
We can use our snapshot file snapshot.has to recover the older version of our dataset into an empty directory tmp.
$ has recover -b dataset -d tmp
-tmp -square square0.png square1.png square2.png -triangle triangle0.png triangle1.png triangle2.png
Folder tmp is now identical to our previous version of folder dataset.
We can check the contents of tmp to ensure that all files have been copied successfully.
$ has check -d tmp
The easiest way to install has is with pip install has.
Alternatively, checkout the latest release version of has (e.g. git checkout v0.0.x), and run sudo ./install.sh. Open a terminal and type has --help to verify installation. Uninstall has by running sudo ./uninstall.sh.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.