Skip to main content

Snowflake warehouse cost, performance, and efficiency analysis library

Project description

❄️ snowflake-cost-analyzer

Modular, typed, production-grade Snowflake cost & performance analytics. Refactored from a 17,000-line monolithic notebook into a clean Python package with strict separation of concerns.

Version 0.2.0
Python ≥ 3.9
License Internal — ProColombia
Quality 10/10 (was 9.5/10 in v0.1.0)

🏛️ Architecture (10/10)

The package strictly separates what is computed from how it is presented:

┌─────────────────────────────────────────────────────────────────┐
│                       CORE LIBRARY                              │
│              snowflake_cost_analyzer/                           │
│                                                                 │
│  config.py        ← @dataclass for credentials + parameters     │
│  connection.py    ← Context-managed Snowflake connection        │
│  constants.py     ← Single source of truth (prices, sizes)      │
│  analyzer.py      ← Orchestrator: 28 analytical methods         │
│                                                                 │
│  queries/         ← Pure SQL builder functions (no I/O)         │
│    cost.py        performance.py    attribution.py              │
│    temporal.py    advanced.py                                   │
│    storage.py    ← NEW v0.2.0 (Phase 9)                         │
│    qas.py        ← NEW v0.2.0 (Phase 10)                        │
│                                                                 │
│  utils/           ← DRY helpers (validators, dataframe, timer)  │
│  viz/             ← Plotly chart factories (optional extra)     │
│  views/           ← ViewsManager (materialise queries as views) │
│                                                                 │
│  RESPONSIBILITY: extract data, return DataFrames. Nothing else. │
└─────────────────────────────────────────────────────────────────┘
                              ▲
                              │  imports
                              │
┌─────────────────────────────────────────────────────────────────┐
│                  PRESENTATION LAYER                             │
│                       scripts/                                  │
│                                                                 │
│  generar_reportes.py   ← CLI report generator                   │
│  templates/                                                     │
│    report_template.html  ← Jinja2 HTML template                 │
│                                                                 │
│  RESPONSIBILITY: turn DataFrames into HTML, Excel, PPTX.        │
│  Imports the library — never the reverse.                       │
└─────────────────────────────────────────────────────────────────┘

Why this matters: the library can be used by any consumer (notebook, dashboard, scheduled job, BI tool), and the report generator can be replaced or extended without touching SQL or domain logic. This is the architectural boundary that took the project from 9.5 to 10.


📦 Installation

pip install -e .                # core only
pip install -e .[viz]           # + Plotly charts
pip install jinja2 xlsxwriter python-pptx kaleido   # for the report generator

🔐 Credentials (never hardcoded)

Two supported sources, in priority order:

1. Google Colab Secrets (preferred for Colab):

from snowflake_cost_analyzer import SnowflakeConfig
sf = SnowflakeConfig.from_colab_secrets(warehouse="WH_IDMC")

2. Environment variables (preferred for CI/CD and local scripts):

export SF_USER=...
export SF_PASSWORD=...
export SF_ACCOUNT=...
sf = SnowflakeConfig.from_env(warehouse="WH_IDMC")

The report generator script (scripts/generar_reportes.py) tries Colab Secrets first and falls back to env vars automatically — you don't choose.


🚀 Quick start (library)

from snowflake_cost_analyzer import (
    SnowflakeCostAnalyzer, SnowflakeConfig, AnalysisConfig,
)

sf = SnowflakeConfig.from_colab_secrets(warehouse="WH_IDMC")
params = AnalysisConfig.last_n_days(30, warehouse_name="WH_IDMC")

with SnowflakeCostAnalyzer(sf, params) as analyzer:
    df = analyzer.get_all_warehouses_summary()
    print(df[["WAREHOUSE_NAME", "TOTAL_COST_USD"]].head())

📊 Available analyses (28 methods)

Phase 1 — Cost

  • get_all_warehouses_summary()
  • get_daily_warehouse_summary()
  • get_cost_reconciliation()

Phase 2-3 — Performance

  • get_top_expensive_queries() · get_critical_alerts() · get_performance_trends()
  • get_queue_analysis() · get_cache_analysis() · get_duplicate_queries()
  • get_optimization_recommendations()

Phase 4 — Attribution

  • get_user_analysis() · get_application_analysis() · get_user_application_matrix()

Phase 5-6 — Temporal & Multi-cluster

  • get_hourly_patterns() · get_weekly_comparison()
  • get_multicluster_load_history() · get_inferred_clusters()

Phase 7-8 — Advanced

  • get_warehouse_metadata() · get_warehouse_events() · get_uptime_analysis()
  • get_12month_trend() · get_monthly_forecast() · get_anomaly_detection()

Phase 9 — Storage (NEW v0.2.0)

  • get_database_storage_costs(lookback_days=30, storage_price_per_tb_month=...)
  • get_table_storage_costs(top_n=50, storage_price_per_tb_month=...)
  • get_storage_growth_trend(lookback_days=90, storage_price_per_tb_month=...)
  • get_stage_storage_costs(lookback_days=30, stage_price_per_tb_month=...)

Phase 10 — Query Acceleration Service (NEW v0.2.0)

  • get_qas_analysis(warehouse_name=None) — single-row aggregated metrics + ROI
  • get_qas_detail(warehouse_name=None, top_n=50) — top-N accelerated queries

Convenience

  • run_full_analysis(include_storage=True, include_qas=True) — runs phases 1-4 plus optionally storage and QAS, returns a single dict.

🏗️ What's new in v0.2.0

1. Storage analysis (queries/storage.py)

Four pure SQL builders sourcing SNOWFLAKE.ACCOUNT_USAGE.DATABASE_STORAGE_USAGE_HISTORY, TABLE_STORAGE_METRICS, and STAGE_STORAGE_USAGE_HISTORY. Prices live in constants.py (STORAGE_PRICE_PER_TB_MONTH, STAGE_PRICE_PER_TB_MONTH) and are overridable per query call — never hardcoded in SQL.

2. QAS analysis (queries/qas.py)

Two pure SQL builders for QAS effectiveness and ROI. The duplicated CASE warehouse_size WHEN ... THEN ... block from the original notebook is eliminated — both queries reuse build_credits_case_sql() from constants.py so the warehouse credits-per-hour table lives in exactly one place. Empty results (QAS disabled, missing grants) are handled gracefully and never raise.

3. New utility: clean_timezones_df

Lives in utils/dataframe.py. Strips timezone information from datetime columns so DataFrames can be written to Excel without xlsxwriter complaining. Non-mutating: returns a copy.

4. Report generator (presentation layer)

A standalone script (scripts/generar_reportes.py) that imports the library and produces three artifacts:

Output Tool Content
report.html Jinja2 + Plotly→PNG KPIs, health score, top findings/recs, embedded charts, preview tables
report.xlsx pandas + xlsxwriter One sheet per analysis (16 sheets), executive KPI sheet first
report.pptx python-pptx 5-slide executive deck: cover, health score, KPIs, top 5 findings, top 5 recommendations

The script never imports anything from the library's viz/ subpackage required (it tries to use it for inline charts, falls back gracefully if the [viz] extra is not installed).


▶️ Generating reports

# Make sure templates/ is alongside scripts/
python scripts/generar_reportes.py \
    --warehouse WH_IDMC \
    --days 30 \
    --credit-price 3.0 \
    --out-dir out \
    --templates-dir templates

Optional flags:

  • --no-storage — skip Phase 9 if your role lacks SELECT on the storage views
  • --no-qas — skip Phase 10 if QAS is disabled or grants are missing

In Google Colab:

!python scripts/generar_reportes.py --warehouse WH_IDMC --days 30

(Colab Secrets SF_USER, SF_PASSWORD, SF_ACCOUNT must be configured.)


🧮 Health Score formula

The report generator computes a 0–100 health score from the analyzer's DataFrames using a pure function (compute_executive_kpis):

score = 100
       − idle_pct                 # 1 point per % of idle time
       − min(20, spillage_pct)    # cap spillage penalty at 20
       − min(15, failed_pct × 3)  # cap failed-query penalty at 15
Score Band Action
≥ 80 Good Maintain current configuration
60-79 Warn Review top findings within 1 sprint
< 60 Bad Immediate optimisation required

The formula intentionally penalises idle time most aggressively because it is the largest controllable cost lever in most Snowflake accounts.


📐 Design rules (the non-negotiables)

These rules govern every change to the codebase:

  1. Single source of truth. Constants live in constants.py only. No magic numbers in SQL or logic. The warehouse credits-per-hour table appears exactly once in the entire package (build_credits_case_sql()).

  2. Pure SQL builders. Functions in queries/ accept config and return strings. They never open connections, never log, never transform DataFrames.

  3. Single execution path. Every analytical method routes through SnowflakeCostAnalyzer._execute_query() for logging, timing, and the views cache. No exceptions.

  4. Centralised type coercion. Numeric columns are converted via to_float(df, [...]) from utils/dataframe.py. The original notebook duplicated this in 4 places — now zero.

  5. Fail Fast. Validation happens in __post_init__ and via utils/validators.py, before any Snowflake query is executed.

  6. No monkey-patching. New functionality is added through proper class methods, not by attaching functions at runtime as the original notebook did (SnowflakeWarehouseAnalyzer.get_X = get_X).

  7. Library / presentation separation. The library returns DataFrames. Nothing in snowflake_cost_analyzer/ imports jinja2, xlsxwriter, or python-pptx. All presentation code lives in scripts/.

  8. Credentials never hardcoded. Always Colab Secrets or env vars, loaded through SnowflakeConfig.from_colab_secrets() or SnowflakeConfig.from_env().

  9. Type hints + English docstrings everywhere. Python 3.9+ syntax (list[str], dict[str, int], X | None).

  10. Graceful empty handling. QAS, storage, and any analysis that might legitimately return zero rows (missing grants, feature disabled) returns an empty DataFrame instead of raising. Callers check df.empty.


📁 File layout

Snowflake Cost Analyzer/
├── snowflake_cost_analyzer/
│   ├── __init__.py
│   ├── _version.py
│   ├── analyzer.py              ← + 6 new methods (storage + QAS)
│   ├── config.py
│   ├── connection.py
│   ├── constants.py             ← + STORAGE_PRICE_PER_TB_MONTH / STAGE_PRICE_PER_TB_MONTH
│   ├── queries/
│   │   ├── __init__.py
│   │   ├── advanced.py
│   │   ├── attribution.py
│   │   ├── cost.py
│   │   ├── performance.py
│   │   ├── temporal.py
│   │   ├── storage.py           ← NEW
│   │   └── qas.py               ← NEW
│   ├── utils/
│   │   ├── dataframe.py         ← + clean_timezones_df
│   │   ├── timer.py
│   │   └── validators.py
│   ├── views/
│   │   └── manager.py
│   └── viz/
│       ├── cost_charts.py
│       ├── performance_charts.py
│       └── temporal_charts.py
├── scripts/
│   └── generar_reportes.py      ← NEW: HTML + Excel + PPTX
├── templates/
│   └── report_template.html     ← NEW: Jinja2 template
├── notebooks/
│   └── ejemplo_uso.py
├── README.md                    ← this file
├── CHANGELOG.md
└── pyproject.toml

🔄 Upgrading from v0.1.0 → v0.2.0

Apply these patches in order:

  1. constants.py — replace with the new file (adds two storage price constants).
  2. queries/__init__.py — replace with the new docstring (mentions storage and qas).
  3. queries/storage.py — new file.
  4. queries/qas.py — new file.
  5. utils/dataframe.py — append clean_timezones_df from utils/dataframe_addition.py.
  6. analyzer.py — apply the three blocks from analyzer_additions.py (imports, new methods, updated run_full_analysis).
  7. scripts/generar_reportes.py — new file.
  8. templates/report_template.html — new file.
  9. Bump _version.py0.2.0.

No existing methods are modified. The upgrade is fully backward-compatible.


🧪 Testing the upgrade

from snowflake_cost_analyzer import SnowflakeCostAnalyzer, SnowflakeConfig, AnalysisConfig

sf = SnowflakeConfig.from_colab_secrets(warehouse="WH_IDMC")
params = AnalysisConfig.last_n_days(7, warehouse_name="WH_IDMC")

with SnowflakeCostAnalyzer(sf, params) as analyzer:
    # New: storage
    df_db_storage = analyzer.get_database_storage_costs(lookback_days=30)
    print(f"Databases: {len(df_db_storage)}")
    print(f"Total TB: {df_db_storage['TOTAL_STORAGE_TB'].sum():.2f}")

    # New: QAS (handles disabled QAS gracefully)
    df_qas = analyzer.get_qas_analysis()
    if df_qas.empty:
        print("QAS: not enabled or no eligible queries")
    else:
        print(f"QAS ROI: {df_qas['ROI_PCT'].iloc[0]:.1f}%")

    # New: full pipeline including storage + QAS
    results = analyzer.run_full_analysis()
    print(f"Total DataFrames produced: {len(results)}")

Then generate the executive reports:

python scripts/generar_reportes.py --warehouse WH_IDMC --days 30
ls out/
# snowflake_report_WH_IDMC_20260409_1430.html
# snowflake_report_WH_IDMC_20260409_1430.xlsx
# snowflake_report_WH_IDMC_20260409_1430.pptx

📜 CHANGELOG

v0.2.0 (2026-04-09)

  • Added queries/storage.py (4 builders for Phase 9 storage analysis)
  • Added queries/qas.py (2 builders for Phase 10 QAS analysis)
  • Added STORAGE_PRICE_PER_TB_MONTH and STAGE_PRICE_PER_TB_MONTH constants (single source of truth)
  • Added 6 new methods to SnowflakeCostAnalyzer (4 storage + 2 QAS)
  • Added clean_timezones_df utility in utils/dataframe.py
  • Added scripts/generar_reportes.py — executive report generator
  • Added templates/report_template.html — Jinja2 HTML template
  • Modified run_full_analysis() — now accepts include_storage and include_qas flags and returns up to 16 DataFrames
  • Eliminated the duplicated CASE warehouse_size block from QAS analysis (now reuses build_credits_case_sql())

v0.1.0

  • Initial modular refactor from monolithic notebook (17,000 lines → library).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snowflake_cost_analyzer-0.2.0.tar.gz (76.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snowflake_cost_analyzer-0.2.0-py3-none-any.whl (85.2 kB view details)

Uploaded Python 3

File details

Details for the file snowflake_cost_analyzer-0.2.0.tar.gz.

File metadata

  • Download URL: snowflake_cost_analyzer-0.2.0.tar.gz
  • Upload date:
  • Size: 76.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for snowflake_cost_analyzer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 228d2e20d995c394df77c6deadaba5600cec6ae1a2e646e761ef384002e28aca
MD5 d60e0ed82d5c9b4ef6a6101105e0d9df
BLAKE2b-256 099db47a71b83cdd5018491b8591b8fd30d08f124d4e3621c6baed59eb878984

See more details on using hashes here.

File details

Details for the file snowflake_cost_analyzer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for snowflake_cost_analyzer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d5e5faf916f94943bc194aa9ffbeb50af097526639de1621098f884f1129426
MD5 aa645c5913f2ffae7ca0abdfed78061b
BLAKE2b-256 61d7e0da954a13031687e52a9ccba19958e5ee8f58e1943e4c861307f58e1426

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page