Deterministic, atomic, save-only-if-modified file writing for Python data pipelines.
Project description
📝 stablewrite
Deterministic, atomic, save-only-if-modified file writing for Python data pipelines.
stablewrite is for scripts that generate files repeatedly but should only touch the output when the underlying data actually changed.
If you use Snakemake, Make, Docker volumes, CI caches, notebooks, or report pipelines, you have probably seen this: a script re-runs, writes the same data again, updates the file modification time, and suddenly half the downstream workflow rebuilds for no real reason.
stablewrite fixes that by writing into an isolated temporary directory first, normalizing volatile metadata, comparing the result with the existing destination, and publishing only when the finalized output is meaningfully different.
from stable_write import save_if_changed
with save_if_changed("output/report.csv") as saver:
saver.path.write_text("id,value\n1,100\n", encoding="utf-8")
print(saver) # saved or skipped, with hashes and reason available on the object
If the generated bytes match the existing file, the destination is left untouched. Its mtime stays exactly as it was.
✨ Features
- Save only if changed: unchanged outputs are discarded, preserving destination
mtimeand avoiding unnecessary downstream rebuilds. - Atomic publish step: files are staged away from the destination, then copied to a destination-side temp file and published with
os.replace. - Deterministic ZIP/OOXML profiles: built-in profiles for
.zip,.xlsx,.docx, and.pptxnormalize ZIP metadata and strip volatile OOXML core properties. - Companion file support: publish bundles such as ESRI Shapefiles (
.shp,.dbf,.shx,.prj,.cpg) together with the main file. - Strict explicit companions: if you request
companions=["foo.csv"], that file must be created, otherwise the save fails without publishing anything. - Semantic comparison hook: use
is_equal=for formats where byte stability is unrealistic but structural equality is easy to check. - Zero core dependencies: the built-in profiles use only the Python standard library. Writer libraries such as pandas, openpyxl, and GeoPandas are only needed by your own code.
- Large ZIP friendly: ZIP entries are streamed during normalization, so embedded media in
.pptxor.docxfiles do not need to be loaded fully into memory.
📦 Installation
Install the core package:
pip install stablewrite
The core library has no runtime dependencies. Install the writer libraries you use in your own pipeline:
pip install pandas openpyxl # if you generate Excel files
pip install geopandas # if you write shapefiles or GeoPackages
The built-in xlsx, docx, and pptx profiles do not require openpyxl; they patch the OOXML ZIP structure directly with the standard library.
🚀 Quickstart
Basic Usage
Write to saver.path, not directly to the final destination. After the with block exits, stablewrite decides whether to publish.
from stable_write import save_if_changed
with save_if_changed("output/report.csv") as saver:
saver.path.write_text("id,value\n1,100\n", encoding="utf-8")
if saver.saved:
print(f"Updated {saver.destination} ({saver.new_hash})")
else:
print(f"Skipped: {saver.reason}")
The Excel Timestamp Problem
pandas.DataFrame.to_excel() writes an OOXML workbook. The workbook can include dynamic metadata such as dcterms:modified, so two identical DataFrames saved one second apart can produce different file hashes.
Use the xlsx profile, or the convenience wrapper, to normalize the workbook before comparison:
import pandas as pd
from stable_write import save_xlsx_if_changed
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
with save_xlsx_if_changed("results/data.xlsx") as saver:
df.to_excel(saver.path, index=False)
if saver.saved:
print("Excel report changed")
Under the hood the xlsx profile patches docProps/core.xml and then rewrites the ZIP container with deterministic entry ordering, timestamps, and extra fields.
Companion Files
Some formats are bundles, not single files. ESRI Shapefiles are the classic example: writing spatial.shp usually also creates spatial.shx, spatial.dbf, spatial.prj, and spatial.cpg.
Use companions="auto" when the writer decides which companion files exist:
import geopandas as gpd
from stable_write import save_if_changed
gdf = gpd.read_file("raw_data.geojson")
with save_if_changed("processed/spatial.shp", companions="auto") as saver:
gdf.to_file(saver.path)
if "spatial.dbf" in saver.changed_companions:
print("Attribute table changed")
If any companion changes, the save is treated as changed and the bundle is published. Each file is replaced atomically on its own; the bundle as a whole is not transactional across multiple files.
Use an explicit list when every companion is required:
with save_if_changed(
"processed/spatial.shp",
companions=["spatial.shx", "spatial.dbf", "spatial.prj"],
) as saver:
gdf.to_file(saver.path)
If one of those listed files is missing from the temporary directory, stablewrite raises FileNotFoundError and leaves the destination untouched. That makes explicit companions a contract, while companions="auto" remains the optional/discovery mode.
Custom Semantic Comparison
Some formats are not realistically byte-stable. SQLite-based formats such as GeoPackage (.gpkg) may include internal metadata, page ordering, or timestamps that make byte hashes noisy.
For those cases, provide is_equal=. The callable receives the newly generated temp file and the existing destination and returns whether they are equivalent.
from pathlib import Path
import geopandas as gpd
from stable_write import save_if_changed
def gpkg_is_equal(new: Path, existing: Path) -> bool:
"""Compare GeoPackages by data content, not raw bytes."""
new_data = gpd.read_file(new)
old_data = gpd.read_file(existing)
return new_data.equals(old_data)
with save_if_changed("data/roads.gpkg", is_equal=gpkg_is_equal) as saver:
gdf.to_file(saver.path, driver="GPKG")
old_hash and new_hash are still computed and stored. is_equal only replaces the equality decision for the main file. Companion files are still compared by hash.
⚙️ API Overview
save_if_changed(...)
save_if_changed(
path,
*,
profile=None,
finalizers=None,
save_strategy="overwrite",
algo="blake2b",
safe_copy=False,
companions="auto",
is_equal=None,
)
| Argument | Purpose |
|---|---|
path |
Final destination path. |
profile |
Named profile: "zip", "xlsx", "docx", "pptx", or any registered custom profile. |
finalizers |
Ordered list of custom (Path) -> None functions run before hashing. Overrides profile. |
save_strategy |
What to do when content changed: "overwrite", "raise", or "skip". |
algo |
Hash algorithm used for byte comparison. Defaults to "blake2b". |
safe_copy |
Use shutil.copyfile instead of shutil.copy2 for the publish copy. |
companions |
"auto", None, [], or an explicit list of companion filenames. |
is_equal |
Optional semantic comparator for the main file. |
Registry
Profiles are stored in a global registry. The following functions manage it:
| Function | Purpose |
|---|---|
register_profile(name, finalizers, is_equal, force) |
Register a named profile for use with profile=. |
get_profile(name) → Profile |
Retrieve a registered profile; raises ValueError if absent. |
list_profiles() → list[str] |
Return a sorted list of all registered profile names. |
All three are importable directly from stable_write.
Built-In Profiles
| Profile | Finalizers | Use case |
|---|---|---|
zip |
normalize_zip_metadata |
Generic ZIP archives with volatile entry metadata. |
xlsx |
strip_ooxml_metadata, normalize_zip_metadata |
Generated Excel workbooks, including pandas/openpyxl output. |
docx |
strip_ooxml_metadata, normalize_zip_metadata |
Generated Word documents. |
pptx |
strip_ooxml_metadata, normalize_zip_metadata |
Generated PowerPoint files, including files with large embedded media. |
Result Object
Inside the context manager you receive a Saver. After the context exits, it exposes:
| Attribute | Meaning |
|---|---|
saver.path |
Temporary path you should write to inside the with block. |
saver.destination |
Final destination path. |
saver.saved |
True if the destination was replaced. |
saver.changed |
True if the new output differed from the existing output. |
saver.reason |
Human-readable decision reason. |
saver.old_hash |
Hash of the existing destination, or None when missing. |
saver.new_hash |
Hash of the finalized temp file. |
saver.changed_companions |
Companion filenames whose bytes changed or appeared. |
🧭 Save Strategies
Use save_strategy to control what happens when content changed:
"overwrite"(default): publish the new output."raise": raiseFileExistsErrorand leave the destination untouched."skip": do not publish, but populatechanged,reason, and hashes on the saver.
"raise" is useful for strict notebook evaluation or audit workflows where a rerun must never mutate canonical outputs silently.
🔌 Custom Profiles
You can package reusable finalizer chains as named profiles. A registered
profile can be selected with profile= anywhere you call save_if_changed,
including in third-party libraries built on top of stablewrite.
from pathlib import Path
from stable_write import register_profile, save_if_changed
from stable_write.finalizers import normalize_zip_metadata
def strip_my_app_header(path: Path) -> None:
"""Remove the generated-on comment from app-specific text exports."""
lines = path.read_text(encoding="utf-8").splitlines()
cleaned = [l for l in lines if not l.startswith("# Generated on")]
path.write_text("\n".join(cleaned) + "\n", encoding="utf-8")
register_profile("my_zip", finalizers=[strip_my_app_header, normalize_zip_metadata])
with save_if_changed("output/bundle.zip", profile="my_zip") as saver:
build_bundle(saver.path)
You can also attach a default is_equal comparator to a profile. When
save_if_changed resolves the profile, is_equal is used automatically unless
the caller provides their own.
register_profile("gpkg", is_equal=gpkg_is_equal)
Use force=True to replace an existing registration (for example, when
testing or when upgrading a profile at startup).
🧹 Custom Finalizers
Finalizers are small functions that mutate the staged temporary file before hashing. They are the right tool when you want the file on disk to be canonical.
Common uses:
- Remove generated headers such as
# Generated on 2026-05-28from text exports. - Re-serialize JSON/YAML with sorted keys and stable indentation.
- Strip image metadata from generated plots.
- Remove absolute local paths from generated reports.
Example: canonical JSON output.
import json
from pathlib import Path
from stable_write import save_if_changed
def canonical_json(path: Path) -> None:
data = json.loads(path.read_text(encoding="utf-8"))
path.write_text(
json.dumps(data, sort_keys=True, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
with save_if_changed("config.json", finalizers=[canonical_json]) as saver:
some_library.write_json(saver.path)
If the finalizer raises, nothing is published. The existing destination stays untouched.
🤔 Finalizers vs. is_equal
Both features help with formats that produce noisy bytes. They solve different problems.
Use a finalizer when you want to fix the generated file before it lands on disk:
- the stored file should have stable formatting;
- downstream tools rely on byte-level stability;
- Git diffs should be clean;
- hashes should represent the normalized artifact.
Use is_equal when you only need a smarter comparison:
- the file format is hard to rewrite safely;
- semantic equality is easy to compute in Python;
- you want to ignore fields during comparison without altering newly saved files;
- you need tolerance-based comparison, such as approximate floats.
Example: compare JSON semantically while ignoring a volatile nested key.
import json
from pathlib import Path
from stable_write import save_if_changed
def json_equal_ignoring_timestamp(new_path: Path, existing_path: Path) -> bool:
new_data = json.loads(new_path.read_text(encoding="utf-8"))
old_data = json.loads(existing_path.read_text(encoding="utf-8"))
new_data.get("metadata", {}).pop("generated_at", None)
old_data.get("metadata", {}).pop("generated_at", None)
return new_data == old_data
with save_if_changed("config.json", is_equal=json_equal_ignoring_timestamp) as saver:
some_library.write_json(saver.path)
If is_equal returns False, the raw generated temp file is published. If you also want to clean the file before publication, use a finalizer as well.
| Scenario | Prefer finalizer | Prefer is_equal |
|---|---|---|
| Stable JSON key order on disk | Yes | Maybe not necessary |
| Ignore a nested timestamp only for comparison | Possible, but changes stored file | Yes |
| Clean Git diffs | Yes | No |
| Approximate float comparison | No | Yes |
| Non-Python downstream byte cache | Yes | No |
| Expensive or risky binary rewrite | No | Yes |
🧱 Guarantees and Boundaries
stablewrite is intentionally conservative:
- Finalizers run before hashing, so profiles can make noisy output deterministic.
- Finalizer failures leave the destination untouched.
- The final publish uses destination-side temporary files and
os.replace. - For companion bundles, each file is replaced atomically, but the bundle is not a transaction.
is_equalaffects only the main file; companions are still tracked by hash.- Explicit companion lists are strict. Use
companions="auto"when companion files are optional.
🧪 Why This Matters
A plain write updates mtime even when the content is identical:
Path("report.csv").write_text(render_report())
That is enough to wake up downstream jobs in Make, Snakemake, Docker layer caches, or CI artifacts.
stablewrite makes the write conditional on the finalized artifact:
with save_if_changed("report.csv") as saver:
saver.path.write_text(render_report(), encoding="utf-8")
Same data means no replacement, no new mtime, and no accidental rebuild.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stable_write-0.1.1.tar.gz.
File metadata
- Download URL: stable_write-0.1.1.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe53cdd15db6c9fc236dc8caac10928bd5db17007abf0c382914fb51b0949018
|
|
| MD5 |
017b85b68791849888fc37f3b0bf2b65
|
|
| BLAKE2b-256 |
c24dc748240c706991442a991c43e6d76cddc82759292b315f7307b6d1f4e930
|
Provenance
The following attestation bundles were made for stable_write-0.1.1.tar.gz:
Publisher:
publish.yml on ews-ffarella/stablewrite
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stable_write-0.1.1.tar.gz -
Subject digest:
fe53cdd15db6c9fc236dc8caac10928bd5db17007abf0c382914fb51b0949018 - Sigstore transparency entry: 1658730197
- Sigstore integration time:
-
Permalink:
ews-ffarella/stablewrite@d691b36a8253bb9cad880972a187bc82fdb2df2d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ews-ffarella
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d691b36a8253bb9cad880972a187bc82fdb2df2d -
Trigger Event:
release
-
Statement type:
File details
Details for the file stable_write-0.1.1-py3-none-any.whl.
File metadata
- Download URL: stable_write-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c42091a431959beb3027a5b37978a46c7243da536621e1f77358a985bac0bc1
|
|
| MD5 |
73c7622bcf6683af6ab451345da09bde
|
|
| BLAKE2b-256 |
18ad85c4b4c4970a815711cbc728c4254c0d599aa0be727164ef7e43c63b261d
|
Provenance
The following attestation bundles were made for stable_write-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ews-ffarella/stablewrite
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stable_write-0.1.1-py3-none-any.whl -
Subject digest:
8c42091a431959beb3027a5b37978a46c7243da536621e1f77358a985bac0bc1 - Sigstore transparency entry: 1658730317
- Sigstore integration time:
-
Permalink:
ews-ffarella/stablewrite@d691b36a8253bb9cad880972a187bc82fdb2df2d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ews-ffarella
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d691b36a8253bb9cad880972a187bc82fdb2df2d -
Trigger Event:
release
-
Statement type: