Skip to main content

Data quality diagnostics for tabular datasets — surfaces hidden problems in plain English

Project description

datascope

Data created upstream — by manufacturing teams entering UPCs, inventory staff assigning product codes, offshore developers choosing column types — silently breaks systems downstream. A product code with letters where EDI expects numbers. Fifteen "N/A" strings buried in 500 numeric rows that pandas silently drops, skewing every calculation by 3%.

datascope finds these problems, explains what's wrong in plain English, and tells you what to fix. It reads each cell's actual type (not what pandas infers), detects hidden quality issues, classifies their severity by downstream impact, and generates a professional diagnostic report.


What It Finds

Detection Example Severity
Mixed types 485 numbers + 15 strings in a "numeric" column Critical
Sentinel values "N/A", "TBD", "pending" hiding in numeric data Critical
Leading-zero inconsistency "00123" alongside "456" — keys that won't match Warning
Mixed date formats "01/15/2026" and "2026-01-15" in the same column Warning
Suspected duplicate IDs 98% unique in an ID column — the other 2% will fan out joins Warning
Near-constant columns 1 distinct value across 10,000 rows Info

Each finding is expressed as assumption vs. reality: what the data appears to be vs. what it actually contains. Every finding includes a downstream impact explanation, a fix recommendation, and a prevention rule.


Installation

pip install datascope-dq

Or install from source:

git clone https://github.com/MsShawnP/datascope.git
cd datascope
pip install -e .

Usage

# Analyze an Excel file
datascope data.xlsx

# Analyze a CSV
datascope sales_export.csv

# Specify a sheet and output directory
datascope data.xlsx --sheet Revenue --output-dir ./client_reports

The tool produces a PDF diagnostic report and prints a summary to stdout:

datascope: Analyzing sample_mixed_types.xlsx...
  200 rows x 6 columns

Found 4 findings:
  2 Critical  ########
  1 Warning   ####
  1 Info      ####

Top critical findings:
  * revenue_mixed: 15 non-numeric values hiding in an otherwise numeric column
  * status: Sentinel values 'N/A' and 'TBD' in numeric data

Report saved: reports/sample_mixed_types_diagnostic.pdf

The Report

The PDF report is structured for non-technical readers — no jargon, no composite scores, no unexplained metrics.

Executive Summary — overall health assessment, finding counts by severity, top critical issues highlighted.

Findings by Severity — each finding presented as a card:

  • Assumption: what the data appears to be
  • Reality: what it actually contains
  • Impact: what breaks downstream
  • Recommended Fix: what to do now
  • Prevention Rule: what right looks like going forward

Field Inventory — summary table of all columns with their detected issue types and severity.

Findings are color-coded (red/amber/blue) and grouped by severity so readers know what to fix first.


How It Works

Most tools let pandas (or the SQL driver, or Excel) decide column types. A column with 485 numbers and 15 strings becomes float64 — the strings become NaN, the type problem disappears, and every downstream calculation is quietly wrong.

datascope reads each cell's actual Python type via openpyxl (for Excel) or raw-string inference (for CSV). This cell-level type detection is always on — there's no flag to enable it because skipping it defeats the purpose.

The analysis pipeline:

  1. Load — read with cell-level type preservation (no silent coercion)
  2. Detect — five analyzers scan for type inconsistencies, sentinels, format issues, and cardinality anomalies
  3. Classify — severity assigned by downstream impact (critical = silent data loss, warning = likely misinterpretation, info = worth noting)
  4. Compose — plain-English narrative generated for each finding
  5. Report — professional PDF rendered with reportlab

Severity Model

Level Meaning Examples
Critical Silent data loss or incorrect calculations will occur Mixed types in numeric columns; sentinel values pandas drops without warning
Warning Key mismatches or misinterpretation likely Leading-zero stripping; ambiguous date formats; duplicate IDs
Info Worth noting, no direct downstream breakage Near-constant columns; unusual cardinality

Project Structure

datascope/
├── loaders/          # Excel and CSV with cell-level type tracking
│   ├── excel.py      # openpyxl-based, preserves per-cell Python types
│   ├── csv_loader.py # Raw string inference (None → int → float → bool → datetime → str)
│   └── base.py       # Extension-based dispatch
├── analyzers/        # Five detectors, each returns list[Finding]
│   ├── type_consistency.py
│   ├── sentinel.py
│   ├── format_check.py
│   └── cardinality.py
├── findings/         # Severity classifier + NL template engine
│   ├── severity.py   # Impact-based classification rules
│   ├── templates.py  # Plain-English templates per finding sub-type
│   ├── composer.py   # Template dispatch
│   └── pipeline.py   # classify → compose → sort
├── reports/
│   └── pdf.py        # Professional PDF with reportlab
└── cli.py            # argparse CLI, pipeline orchestration

Requirements

  • Python 3.10+
  • pandas >= 2.0
  • openpyxl >= 3.1
  • reportlab >= 4.0
  • defusedxml >= 0.7

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascope_dq-2.1.0.tar.gz (50.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datascope_dq-2.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file datascope_dq-2.1.0.tar.gz.

File metadata

  • Download URL: datascope_dq-2.1.0.tar.gz
  • Upload date:
  • Size: 50.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datascope_dq-2.1.0.tar.gz
Algorithm Hash digest
SHA256 92da2a782479a0542b675c5958f359cbab506c2e4ff1696a8c72ef23f7cd8451
MD5 46ef728bb2d75f58b226c9fc292065e8
BLAKE2b-256 5259aa33b9e88f71294d592b8c3305acb0d38dfe9d633442400bd0633bdf112c

See more details on using hashes here.

Provenance

The following attestation bundles were made for datascope_dq-2.1.0.tar.gz:

Publisher: publish.yml on MsShawnP/datascope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datascope_dq-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: datascope_dq-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datascope_dq-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ebd485153480b38e92f868036e1d07234f65d468fcfc2a09448458a744d9ff23
MD5 b46ac1b86c50e280e4d893b627bf0d56
BLAKE2b-256 f9348d15379f8f7f1c95ccdc29506fe4d2ec423d84f41fe4b529bb3f7f813a1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for datascope_dq-2.1.0-py3-none-any.whl:

Publisher: publish.yml on MsShawnP/datascope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page