Data quality diagnostics for tabular datasets — surfaces hidden problems in plain English

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

datascope

Data created upstream — by manufacturing teams entering UPCs, inventory staff assigning product codes, offshore developers choosing column types — silently breaks systems downstream. A product code with letters where EDI expects numbers. Fifteen "N/A" strings buried in 500 numeric rows that pandas silently drops, skewing every calculation by 3%.

datascope finds these problems, explains what's wrong in plain English, and tells you what to fix. It reads each cell's actual type (not what pandas infers), detects hidden quality issues, classifies their severity by downstream impact, and generates a professional diagnostic report.

What It Finds

Detection	Example	Severity
Mixed types	485 numbers + 15 strings in a "numeric" column	Critical
Sentinel values	"N/A", "TBD", "pending" hiding in numeric data	Critical
Leading-zero inconsistency	"00123" alongside "456" — keys that won't match	Warning
Mixed date formats	"01/15/2026" and "2026-01-15" in the same column	Warning
Suspected duplicate IDs	98% unique in an ID column — the other 2% will fan out joins	Warning
Near-constant columns	1 distinct value across 10,000 rows	Info

Each finding is expressed as assumption vs. reality: what the data appears to be vs. what it actually contains. Every finding includes a downstream impact explanation, a fix recommendation, and a prevention rule.

Installation

pip install datascope-dq

Or install from source:

git clone https://github.com/MsShawnP/datascope.git
cd datascope
pip install -e .

Usage

# Analyze an Excel file
datascope data.xlsx

# Analyze a CSV
datascope sales_export.csv

# Specify a sheet and output directory
datascope data.xlsx --sheet Revenue --output-dir ./client_reports

The tool produces a PDF diagnostic report and prints a summary to stdout:

datascope: Analyzing sample_mixed_types.xlsx...
  200 rows x 6 columns

Found 4 findings:
  2 Critical  ########
  1 Warning   ####
  1 Info      ####

Top critical findings:
  * revenue_mixed: 15 non-numeric values hiding in an otherwise numeric column
  * status: Sentinel values 'N/A' and 'TBD' in numeric data

Report saved: reports/sample_mixed_types_diagnostic.pdf

The Report

The PDF report is structured for non-technical readers — no jargon, no composite scores, no unexplained metrics.

Executive Summary — overall health assessment, finding counts by severity, top critical issues highlighted.

Findings by Severity — each finding presented as a card:

Assumption: what the data appears to be
Reality: what it actually contains
Impact: what breaks downstream
Recommended Fix: what to do now
Prevention Rule: what right looks like going forward

Field Inventory — summary table of all columns with their detected issue types and severity.

Findings are color-coded (red/amber/blue) and grouped by severity so readers know what to fix first.

How It Works

Most tools let pandas (or the SQL driver, or Excel) decide column types. A column with 485 numbers and 15 strings becomes float64 — the strings become NaN, the type problem disappears, and every downstream calculation is quietly wrong.

datascope reads each cell's actual Python type via openpyxl (for Excel) or raw-string inference (for CSV). This cell-level type detection is always on — there's no flag to enable it because skipping it defeats the purpose.

The analysis pipeline:

Load — read with cell-level type preservation (no silent coercion)
Detect — five analyzers scan for type inconsistencies, sentinels, format issues, and cardinality anomalies
Classify — severity assigned by downstream impact (critical = silent data loss, warning = likely misinterpretation, info = worth noting)
Compose — plain-English narrative generated for each finding
Report — professional PDF rendered with reportlab

Severity Model

Level	Meaning	Examples
Critical	Silent data loss or incorrect calculations will occur	Mixed types in numeric columns; sentinel values pandas drops without warning
Warning	Key mismatches or misinterpretation likely	Leading-zero stripping; ambiguous date formats; duplicate IDs
Info	Worth noting, no direct downstream breakage	Near-constant columns; unusual cardinality

Project Structure

datascope/
├── loaders/          # Excel and CSV with cell-level type tracking
│   ├── excel.py      # openpyxl-based, preserves per-cell Python types
│   ├── csv_loader.py # Raw string inference (None → int → float → bool → datetime → str)
│   └── base.py       # Extension-based dispatch
├── analyzers/        # Five detectors, each returns list[Finding]
│   ├── type_consistency.py
│   ├── sentinel.py
│   ├── format_check.py
│   └── cardinality.py
├── findings/         # Severity classifier + NL template engine
│   ├── severity.py   # Impact-based classification rules
│   ├── templates.py  # Plain-English templates per finding sub-type
│   ├── composer.py   # Template dispatch
│   └── pipeline.py   # classify → compose → sort
├── reports/
│   └── pdf.py        # Professional PDF with reportlab
└── cli.py            # argparse CLI, pipeline orchestration

Requirements

Python 3.10+
pandas >= 2.0
openpyxl >= 3.1
reportlab >= 4.0
defusedxml >= 0.7

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

msshawnp

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.2.0

May 15, 2026

This version

2.1.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascope_dq-2.1.0.tar.gz (50.9 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datascope_dq-2.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file datascope_dq-2.1.0.tar.gz.

File metadata

Download URL: datascope_dq-2.1.0.tar.gz
Upload date: May 15, 2026
Size: 50.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datascope_dq-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`92da2a782479a0542b675c5958f359cbab506c2e4ff1696a8c72ef23f7cd8451`
MD5	`46ef728bb2d75f58b226c9fc292065e8`
BLAKE2b-256	`5259aa33b9e88f71294d592b8c3305acb0d38dfe9d633442400bd0633bdf112c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datascope_dq-2.1.0.tar.gz:

Publisher: publish.yml on MsShawnP/datascope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datascope_dq-2.1.0.tar.gz
- Subject digest: 92da2a782479a0542b675c5958f359cbab506c2e4ff1696a8c72ef23f7cd8451
- Sigstore transparency entry: 1549557774
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: MsShawnP/datascope@47a18369120759ffaed2a14e9d48be210034e330
- Branch / Tag: refs/tags/v2.1.0
- Owner: https://github.com/MsShawnP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@47a18369120759ffaed2a14e9d48be210034e330
- Trigger Event: push

File details

Details for the file datascope_dq-2.1.0-py3-none-any.whl.

File metadata

Download URL: datascope_dq-2.1.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 35.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datascope_dq-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebd485153480b38e92f868036e1d07234f65d468fcfc2a09448458a744d9ff23`
MD5	`b46ac1b86c50e280e4d893b627bf0d56`
BLAKE2b-256	`f9348d15379f8f7f1c95ccdc29506fe4d2ec423d84f41fe4b529bb3f7f813a1b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datascope_dq-2.1.0-py3-none-any.whl:

Publisher: publish.yml on MsShawnP/datascope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datascope_dq-2.1.0-py3-none-any.whl
- Subject digest: ebd485153480b38e92f868036e1d07234f65d468fcfc2a09448458a744d9ff23
- Sigstore transparency entry: 1549557784
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: MsShawnP/datascope@47a18369120759ffaed2a14e9d48be210034e330
- Branch / Tag: refs/tags/v2.1.0
- Owner: https://github.com/MsShawnP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@47a18369120759ffaed2a14e9d48be210034e330
- Trigger Event: push

datascope-dq 2.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

datascope

What It Finds

Installation

Usage

The Report

How It Works

Severity Model

Project Structure

Requirements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance