Skip to main content

Deterministic, atomic, save-only-if-modified file writing for Python data pipelines.

Project description

📝 stablewrite

CI PyPI

Deterministic, atomic, save-only-if-modified file writing for Python data pipelines.

stablewrite is for scripts that generate files repeatedly but should only touch the output when the underlying data actually changed.

If you use Snakemake, Make, Docker volumes, CI caches, notebooks, or report pipelines, you have probably seen this: a script re-runs, writes the same data again, updates the file modification time, and suddenly half the downstream workflow rebuilds for no real reason.

stablewrite fixes that by writing into an isolated temporary directory first, normalizing volatile metadata, comparing the result with the existing destination, and publishing only when the finalized output is meaningfully different.

from stable_write import save_if_changed

with save_if_changed("output/report.csv") as saver:
    saver.path.write_text("id,value\n1,100\n", encoding="utf-8")

print(saver)  # saved or skipped, with hashes and reason available on the object

If the generated bytes match the existing file, the destination is left untouched. Its mtime stays exactly as it was.

✨ Features

  • Save only if changed: unchanged outputs are discarded, preserving destination mtime and avoiding unnecessary downstream rebuilds.
  • Atomic publish step: files are staged away from the destination, then copied to a destination-side temp file and published with os.replace.
  • Deterministic ZIP/OOXML profiles: built-in profiles for .zip, .xlsx, .docx, and .pptx normalize ZIP metadata and strip volatile OOXML core properties.
  • Companion file support: publish bundles such as ESRI Shapefiles (.shp, .dbf, .shx, .prj, .cpg) together with the main file.
  • Strict explicit companions: if you request companions=["foo.csv"], that file must be created, otherwise the save fails without publishing anything.
  • Semantic comparison hook: use is_equal= for formats where byte stability is unrealistic but structural equality is easy to check.
  • Zero core dependencies: the built-in profiles use only the Python standard library. Writer libraries such as pandas, openpyxl, and GeoPandas are only needed by your own code.
  • Large ZIP friendly: ZIP entries are streamed during normalization, so embedded media in .pptx or .docx files do not need to be loaded fully into memory.

📦 Installation

Install the core package:

pip install stablewrite

The core library has no runtime dependencies. Install the writer libraries you use in your own pipeline:

pip install pandas openpyxl      # if you generate Excel files
pip install geopandas            # if you write shapefiles or GeoPackages

The built-in xlsx, docx, and pptx profiles do not require openpyxl; they patch the OOXML ZIP structure directly with the standard library.

🚀 Quickstart

Basic Usage

Write to saver.path, not directly to the final destination. After the with block exits, stablewrite decides whether to publish.

from stable_write import save_if_changed

with save_if_changed("output/report.csv") as saver:
    saver.path.write_text("id,value\n1,100\n", encoding="utf-8")

if saver.saved:
    print(f"Updated {saver.destination} ({saver.new_hash})")
else:
    print(f"Skipped: {saver.reason}")

The Excel Timestamp Problem

pandas.DataFrame.to_excel() writes an OOXML workbook. The workbook can include dynamic metadata such as dcterms:modified, so two identical DataFrames saved one second apart can produce different file hashes.

Use the xlsx profile, or the convenience wrapper, to normalize the workbook before comparison:

import pandas as pd
from stable_write import save_xlsx_if_changed

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})

with save_xlsx_if_changed("results/data.xlsx") as saver:
    df.to_excel(saver.path, index=False)

if saver.saved:
    print("Excel report changed")

Under the hood the xlsx profile patches docProps/core.xml and then rewrites the ZIP container with deterministic entry ordering, timestamps, and extra fields.

Companion Files

Some formats are bundles, not single files. ESRI Shapefiles are the classic example: writing spatial.shp usually also creates spatial.shx, spatial.dbf, spatial.prj, and spatial.cpg.

Use companions="auto" when the writer decides which companion files exist:

import geopandas as gpd
from stable_write import save_if_changed

gdf = gpd.read_file("raw_data.geojson")

with save_if_changed("processed/spatial.shp", companions="auto") as saver:
    gdf.to_file(saver.path)

if "spatial.dbf" in saver.changed_companions:
    print("Attribute table changed")

If any companion changes, the save is treated as changed and the bundle is published. Each file is replaced atomically on its own; the bundle as a whole is not transactional across multiple files.

Use an explicit list when every companion is required:

with save_if_changed(
    "processed/spatial.shp",
    companions=["spatial.shx", "spatial.dbf", "spatial.prj"],
) as saver:
    gdf.to_file(saver.path)

If one of those listed files is missing from the temporary directory, stablewrite raises FileNotFoundError and leaves the destination untouched. That makes explicit companions a contract, while companions="auto" remains the optional/discovery mode.

Custom Semantic Comparison

Some formats are not realistically byte-stable. SQLite-based formats such as GeoPackage (.gpkg) may include internal metadata, page ordering, or timestamps that make byte hashes noisy.

For those cases, provide is_equal=. The callable receives the newly generated temp file and the existing destination and returns whether they are equivalent.

from pathlib import Path

import geopandas as gpd
from stable_write import save_if_changed


def gpkg_is_equal(new: Path, existing: Path) -> bool:
    """Compare GeoPackages by data content, not raw bytes."""
    new_data = gpd.read_file(new)
    old_data = gpd.read_file(existing)
    return new_data.equals(old_data)


with save_if_changed("data/roads.gpkg", is_equal=gpkg_is_equal) as saver:
    gdf.to_file(saver.path, driver="GPKG")

old_hash and new_hash are still computed and stored. is_equal only replaces the equality decision for the main file. Companion files are still compared by hash.

⚙️ API Overview

save_if_changed(...)

save_if_changed(
    path,
    *,
    profile=None,
    finalizers=None,
    save_strategy="overwrite",
    algo="blake2b",
    safe_copy=False,
    companions="auto",
    is_equal=None,
)
Argument Purpose
path Final destination path.
profile Named profile: "zip", "xlsx", "docx", "pptx", or any registered custom profile.
finalizers Ordered list of custom (Path) -> None functions run before hashing. Overrides profile.
save_strategy What to do when content changed: "overwrite", "raise", or "skip".
algo Hash algorithm used for byte comparison. Defaults to "blake2b".
safe_copy Use shutil.copyfile instead of shutil.copy2 for the publish copy.
companions "auto", None, [], or an explicit list of companion filenames.
is_equal Optional semantic comparator for the main file.

Registry

Profiles are stored in a global registry. The following functions manage it:

Function Purpose
register_profile(name, finalizers, is_equal, force) Register a named profile for use with profile=.
get_profile(name) → Profile Retrieve a registered profile; raises ValueError if absent.
list_profiles() → list[str] Return a sorted list of all registered profile names.

All three are importable directly from stable_write.

Built-In Profiles

Profile Finalizers Use case
zip normalize_zip_metadata Generic ZIP archives with volatile entry metadata.
xlsx strip_ooxml_metadata, normalize_zip_metadata Generated Excel workbooks, including pandas/openpyxl output.
docx strip_ooxml_metadata, normalize_zip_metadata Generated Word documents.
pptx strip_ooxml_metadata, normalize_zip_metadata Generated PowerPoint files, including files with large embedded media.

Result Object

Inside the context manager you receive a Saver. After the context exits, it exposes:

Attribute Meaning
saver.path Temporary path you should write to inside the with block.
saver.destination Final destination path.
saver.saved True if the destination was replaced.
saver.changed True if the new output differed from the existing output.
saver.reason Human-readable decision reason.
saver.old_hash Hash of the existing destination, or None when missing.
saver.new_hash Hash of the finalized temp file.
saver.changed_companions Companion filenames whose bytes changed or appeared.

🧭 Save Strategies

Use save_strategy to control what happens when content changed:

  • "overwrite" (default): publish the new output.
  • "raise": raise FileExistsError and leave the destination untouched.
  • "skip": do not publish, but populate changed, reason, and hashes on the saver.

"raise" is useful for strict notebook evaluation or audit workflows where a rerun must never mutate canonical outputs silently.

🔌 Custom Profiles

You can package reusable finalizer chains as named profiles. A registered profile can be selected with profile= anywhere you call save_if_changed, including in third-party libraries built on top of stablewrite.

from pathlib import Path

from stable_write import register_profile, save_if_changed
from stable_write.finalizers import normalize_zip_metadata


def strip_my_app_header(path: Path) -> None:
    """Remove the generated-on comment from app-specific text exports."""
    lines = path.read_text(encoding="utf-8").splitlines()
    cleaned = [l for l in lines if not l.startswith("# Generated on")]
    path.write_text("\n".join(cleaned) + "\n", encoding="utf-8")


register_profile("my_zip", finalizers=[strip_my_app_header, normalize_zip_metadata])

with save_if_changed("output/bundle.zip", profile="my_zip") as saver:
    build_bundle(saver.path)

You can also attach a default is_equal comparator to a profile. When save_if_changed resolves the profile, is_equal is used automatically unless the caller provides their own.

register_profile("gpkg", is_equal=gpkg_is_equal)

Use force=True to replace an existing registration (for example, when testing or when upgrading a profile at startup).

🧹 Custom Finalizers

Finalizers are small functions that mutate the staged temporary file before hashing. They are the right tool when you want the file on disk to be canonical.

Common uses:

  • Remove generated headers such as # Generated on 2026-05-28 from text exports.
  • Re-serialize JSON/YAML with sorted keys and stable indentation.
  • Strip image metadata from generated plots.
  • Remove absolute local paths from generated reports.

Example: canonical JSON output.

import json
from pathlib import Path

from stable_write import save_if_changed


def canonical_json(path: Path) -> None:
    data = json.loads(path.read_text(encoding="utf-8"))
    path.write_text(
        json.dumps(data, sort_keys=True, indent=2, ensure_ascii=False) + "\n",
        encoding="utf-8",
    )


with save_if_changed("config.json", finalizers=[canonical_json]) as saver:
    some_library.write_json(saver.path)

If the finalizer raises, nothing is published. The existing destination stays untouched.

🤔 Finalizers vs. is_equal

Both features help with formats that produce noisy bytes. They solve different problems.

Use a finalizer when you want to fix the generated file before it lands on disk:

  • the stored file should have stable formatting;
  • downstream tools rely on byte-level stability;
  • Git diffs should be clean;
  • hashes should represent the normalized artifact.

Use is_equal when you only need a smarter comparison:

  • the file format is hard to rewrite safely;
  • semantic equality is easy to compute in Python;
  • you want to ignore fields during comparison without altering newly saved files;
  • you need tolerance-based comparison, such as approximate floats.

Example: compare JSON semantically while ignoring a volatile nested key.

import json
from pathlib import Path

from stable_write import save_if_changed


def json_equal_ignoring_timestamp(new_path: Path, existing_path: Path) -> bool:
    new_data = json.loads(new_path.read_text(encoding="utf-8"))
    old_data = json.loads(existing_path.read_text(encoding="utf-8"))

    new_data.get("metadata", {}).pop("generated_at", None)
    old_data.get("metadata", {}).pop("generated_at", None)

    return new_data == old_data


with save_if_changed("config.json", is_equal=json_equal_ignoring_timestamp) as saver:
    some_library.write_json(saver.path)

If is_equal returns False, the raw generated temp file is published. If you also want to clean the file before publication, use a finalizer as well.

Scenario Prefer finalizer Prefer is_equal
Stable JSON key order on disk Yes Maybe not necessary
Ignore a nested timestamp only for comparison Possible, but changes stored file Yes
Clean Git diffs Yes No
Approximate float comparison No Yes
Non-Python downstream byte cache Yes No
Expensive or risky binary rewrite No Yes

🧱 Guarantees and Boundaries

stablewrite is intentionally conservative:

  • Finalizers run before hashing, so profiles can make noisy output deterministic.
  • Finalizer failures leave the destination untouched.
  • The final publish uses destination-side temporary files and os.replace.
  • For companion bundles, each file is replaced atomically, but the bundle is not a transaction.
  • is_equal affects only the main file; companions are still tracked by hash.
  • Explicit companion lists are strict. Use companions="auto" when companion files are optional.

🧪 Why This Matters

A plain write updates mtime even when the content is identical:

Path("report.csv").write_text(render_report())

That is enough to wake up downstream jobs in Make, Snakemake, Docker layer caches, or CI artifacts.

stablewrite makes the write conditional on the finalized artifact:

with save_if_changed("report.csv") as saver:
    saver.path.write_text(render_report(), encoding="utf-8")

Same data means no replacement, no new mtime, and no accidental rebuild.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stable_write-0.1.3.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stable_write-0.1.3-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file stable_write-0.1.3.tar.gz.

File metadata

  • Download URL: stable_write-0.1.3.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for stable_write-0.1.3.tar.gz
Algorithm Hash digest
SHA256 66f7b9e5b6e3088786c4351fdb6a2e37ad19349a806d52c19b3dc8acfe52e786
MD5 5b887ff622c1bff5cf9d4a8531f55615
BLAKE2b-256 2713f3b706e9a53965bcbf4e56c8b7721ac515f8a7a05e4e7c8a6b45c2ae5af9

See more details on using hashes here.

Provenance

The following attestation bundles were made for stable_write-0.1.3.tar.gz:

Publisher: publish.yml on ews-ffarella/stablewrite

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stable_write-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: stable_write-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for stable_write-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dff282c8207ac454e4cb5016365ee5c317c3cd3a9916699a98625441f7dfd197
MD5 e0091758042597426cae360c7361854c
BLAKE2b-256 e7dd59f72c20769e429d52c46fe47c20f4811198e9015e6469ed96a2d862f8e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for stable_write-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ews-ffarella/stablewrite

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page