dsqusss Python package

Project description

🧩 Dataset Quality Scoring Engine — System Framework (Markdown)

#️⃣ 1. Overview

The Dataset Quality Scoring Engine (DQS) evaluates the quality of any dataset using automated, model-agnostic metrics. The system processes user-uploaded datasets, computes embeddings, analyzes statistical and semantic properties, and outputs a standardized quality score (0–100) along with detailed submetrics.

2. High-Level Workflow

User Upload → Preprocessing → Embedding → Metric Computation → Scoring → Report Generation → Cleanup

3. Input Specifications

The system accepts:

jsonl json txt csv folder of text/code files PDFs (extracted into text)

4. Preprocessing Pipeline

Validate file format Convert to normalized internal format (list[str or dict]) Clean text: remove control chars normalize whitespace optional: strip HTML/markup Segment long documents into meaningful chunks Remove empty or invalid samples

Output: clean, structured dataset

5. Embedding Generation

Two embedding flows:

5.1 Local Embeddings (Per Upload)

Used for:

redundancy coherence diversity factual contradictions clustering/domain analysis

These embeddings exist only for the request and are deleted afterward.

5.2 Global Reference Embeddings (Static)

Used only for novelty detection.

Pre-built FAISS/Vector DB containing ~1M representative samples:

Wikipedia Common Crawl samples C4 slices StackOverflow Books corpus Public domain corpora

This is static, never modified by user uploads.

6. Metric Computation

DQS computes 10 core quality metrics:

6.1 Redundancy Score compute embedding similarity within dataset clustering density = redundancy score = inverse redundancy 6.2 Malware / Toxicity Score run samples through pre-trained toxicity classifier aggregate severity 6.3 Diversity Score linguistic diversity (entropy, vocab richness) semantic diversity (embedding variance) 6.4 Readability Score Flesch–Kincaid sentence complexity coherency heuristics 6.5 Semantic Coherence embedding flow consistency perplexity using a small reference LLM 6.6 Novelty Score compare against global reference corpus nearest neighbor distance = novelty measure 6.7 Structure Quality

Applicable to:

JSON code SQL XML YAML

Checks:

syntax validity AST parsing success 6.8 Factual Conflict Score sample random pairs pass to NLI contradiction model aggregate contradictions 6.9 Domain Balance Score cluster dataset embeddings measure cluster distribution via entropy 6.10 Length Distribution Score detect outliers analyze token distribution

7. Composite Score Calculation

All metrics normalized 0–100.

Weighted aggregation formula:

overall_score = 0.15redundancy + 0.10toxicity + 0.10diversity + 0.10readability + 0.10coherence + 0.10novelty + 0.10structure + 0.10factual_conflict + 0.075domain_balance + 0.075length_distribution

8. Report Generation

Output includes:

8.1 JSON Report

Contains:

overall_score all sub-scores dataset metadata top detected issues summary of duplicates domain distribution histogram 8.2 Human-Readable Text Report simple explanations listed issues recommendations optional PDF

9. System Architecture

Components API Layer file upload async processing report delivery Compute Engine embeddings scoring logic batching concurrency optimized Reference Store FAISS/Qdrant global novelty index static Models Folder toxicity classifier contradiction/NLI model small LLM for perplexity

10. Execution Flow Diagram

[Upload] ↓ [Preprocess] ↓ [Generate Local Embeddings] ↓ [Compute All Self-Contained Metrics] ↓ [Compare with Global Reference Embeddings] ↓ [Aggregate Scores] ↓ [Generate JSON + Text Report] ↓ [Return to User] ↓ [Delete all temp embeddings + data]

11. Privacy Model

No dataset stored after processing No embeddings stored Only the report is saved (optional) Global reference embeddings NEVER contain user data Fully GDPR-safe

12. MVP Boundary (Important)

Not included in v1:

dataset cleaning dataset repair dataset marketplace collaborative annotation data augmentation agentic workflows

You stay laser-focused on: analysis → scoring → reporting.

Project details

Release history Release notifications | RSS feed

0.0.10

Apr 8, 2026

0.0.9

Apr 7, 2026

This version

0.0.8

Apr 7, 2026

0.0.7

Apr 5, 2026

0.dev4 pre-release

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsqus-0.0.8.tar.gz (4.4 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsqus-0.0.8-py3-none-any.whl (4.7 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file dsqus-0.0.8.tar.gz.

File metadata

Download URL: dsqus-0.0.8.tar.gz
Upload date: Apr 7, 2026
Size: 4.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dsqus-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`a0535a3183fa98c5730d30291405d08631469e50c124297ad623a158e5060716`
MD5	`f669825a81453f5e690cfe1961497665`
BLAKE2b-256	`e0bbdcffe926e436ec4fd0e5b45f89685a58570cb4322d5927b7c68ba799c848`

See more details on using hashes here.

File details

Details for the file dsqus-0.0.8-py3-none-any.whl.

File metadata

Download URL: dsqus-0.0.8-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 4.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dsqus-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`371afb8194a6d420683761e320e5364de01d99f3345145e4b56bae6ca2231d9b`
MD5	`4d5d060438ed1c6ef37e6a4607ec482d`
BLAKE2b-256	`bdd496c0f07464b0cc22044b16561bdb79197c860f3c74b7eb408b5926b72550`

See more details on using hashes here.

dsqus 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🧩 Dataset Quality Scoring Engine — System Framework (Markdown)

2. High-Level Workflow

3. Input Specifications

4. Preprocessing Pipeline

5. Embedding Generation

6. Metric Computation

7. Composite Score Calculation

8. Report Generation

9. System Architecture

10. Execution Flow Diagram

11. Privacy Model

12. MVP Boundary (Important)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes