Convenient long-term storage of python data
Project description
ppersist — Convenient long-term storage of python data
Introduction
There are many ways[^1] to save data from python for future use, but none are quite as convenient as Matlab and Octave’s “save”:
save filename.mat variableA variableB ...
which saves the named variables directly to a file. The corresponding
load filename.mat
reloads the saved variables and dumps them into the global namespace.
This packages aims to offer the closest pythonic equivalent. With it, you can write
ppersist.save("filename.pkl", variableA, variableB, ...)
to save some variables, and
data = ppersist.load("filename.pkl")
to reload them into a namedtuple.
[^1]: See the Alternatives section at the end of this document.
Installation
As easy as:
pip install ppersist
Saving data
The most straightforward way to save your data is
ppersist.save("filename.pkl", variableA, variableB, ...)
Note that this uses the inspect module to discover the names of the
variables to be saved. That means that variableA, variableB, etc.,
must all be simple variables. Variable names must start with a letter
and may only contain letters, numbers, and underscore.
Alternatively, if your data are already packaged in a dictionary, you can write
ppersist.savedict("filename.pkl", dct)
which results in a ppersist-loadable file in which the variables are the items of the dictionary.
Loading data
The most straightforward way to load previously saved data is
dat = ppersist.load("filename.pkl")
This returns a named tuple in which the members are named according to the variables in the file.
As an alternative,
dct = ppersist.loaddict("filename.pkl")
returns the variables as items in a dictionary.
Both forms may be used without restriction with files created by
ppersist.save or ppersist.savedict.
Finally, the form
ppersist.mload("filename.pkl")
comes closest to the simplicity of Octave and Matlab by creating variables directly in the caller’s namespace. This is decidedly unpythonic and confuses IDEs, so is not generally recommended except for throw-away code typed into a terminal.
Considerations of security and reliability
Naturally, you want to be sure that you will be able to reload your data and that loading files your receive from others are safe to load.
First of all, be reminded that ppersist is Free Software. As such, it is “distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.” (Please read the Full license terms.)
That said, the author shares your concerns and has made efforts to make ppersist safe and reliable. Read on for details.
Security
By this I mean: How certain can you be that ppersist will not run arbitrary code contained in a file it loads?
Internally, ppersist uses pickle to save and load its files. Any
quick glance online will tell you that pickle is not inherently
safe. To mitigate this, ppersist.load and friends use a whitelist to
strictly limit what can be loaded. Objects with types not on the
whitelist are not loaded and their defining modules are not
imported. This should provide some reassurance that files you receive
from others are safe to load.
All of that said, if human lives depend on the integrity of your server, you should absolutely do your own due diligence.
Reliability
By this I mean: How likely is it that you will be able to still access
your data in the distant future? Here, the use of pickle is actually
an advantage, as that module is part of core Python and unlikely to
ever go away.
Before actually writing anything to disk, ppersist.save and friends
check whether variables can be safely reloaded. If not, they raise an
exception. This should provide some reassurance that you will be able
to reload your files in the future.
In case you do find yourself unable to load a definitely trusted file
due to an inconsistency between ppersist’s save-time and load-time
checking, you can disable the whitelist by passing trusted=True to
ppersist.load. (If that happens, it is a bug, and the author wants
to hear from you.)
In the worst case, if you find yourself without access to ppersist entirely, you can always reload your data with
dct = pickle.load("filename.pkl")
even though that provides no security at all.
Alternatives
To help you decide whether ppersist is for you, here are three alternatives I considered before writing ppersist.
JSON
Widely supported, human-readable, JSON is a great file format for small quantities of data. It naturally supports basic Python data types and structures. However, for larger datasets, its main strength—storage as text—becomes a liability as it imposes a significant cost both in processing time and storage space. Also, there is no agreed-upon standard for represention numpy arrays in JSON.
CSV
Also widely supported and human-readable, CSV is good for handling straight-up two-dimensional arrays of data, but it does not naturally handle other Python structures. Also, inefficient for larger datasets.
NPY
Numpy can store its main data format (np.ndarray) in .NPY files,
simply using np.save, and reload it with np.load. However, this
format cannot hold multiple np.ndarrays in one file. Neither can it
store other Python data structures (even when wrapped inside an
np.ndarray.)
HDF5
A very capacious file format, HDF5 can in principle store all the
data formats that ppersist supports and is excellent for long-term
storage of very large datasets, as files can be written
incrementally. However, the mapping between Python datatypes and HDF5
datatypes is not quite one-to-one. The pandas library can use HDF5
for storing its DataFrames on disk, if only for a subset of the
datatypes and structures it support.
I considered using HDF5 as a backend for ppersist. (Using it directly
was not attractive to me as the syntax for storing and retrieving data
is more elaborate than I wanted.) What kept me from choosing it, in
the end, was the complexity of supporting all the basic Python
datatypes, especially when considering their hierarchical inclusion in
pandas DataFrames and the like.
Parquet
Developed by Apache, Parquet provides “high performance
compression and encoding schemes to handle complex data in
bulk”. Similar
considerations to HDF5 apply: the fastparquet library does not (yet)
support all Python datatypes, so constructing an easy-to-ues API like
ppersist’s on top of it would be complex.
Development
Development of ppersist is on github.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ppersist-0.10.1-py3-none-any.whl.
File metadata
- Download URL: ppersist-0.10.1-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63c401764ddb01419ef01c2b1767b74707f0718e88f334fe7d640bed1bf78875
|
|
| MD5 |
d8428d85f975be22488df78e2ef1a8bc
|
|
| BLAKE2b-256 |
6fd876f3d5a49e65374204770910baa824c1d847d40d02b062b70f9a263a7043
|