Skip to main content

dsqusss Python package

Project description

🧩 Dataset Quality Scoring Engine — System Framework (Markdown)

#️⃣ 1. Overview

The Dataset Quality Scoring Engine (DQS) evaluates the quality of any dataset using automated, model-agnostic metrics. The system processes user-uploaded datasets, computes embeddings, analyzes statistical and semantic properties, and outputs a standardized quality score (0–100) along with detailed submetrics.

2. High-Level Workflow

User Upload → Preprocessing → Embedding → Metric Computation → Scoring → Report Generation → Cleanup

3. Input Specifications

The system accepts:

jsonl json txt csv folder of text/code files PDFs (extracted into text)

4. Preprocessing Pipeline

Validate file format Convert to normalized internal format (list[str or dict]) Clean text: remove control chars normalize whitespace optional: strip HTML/markup Segment long documents into meaningful chunks Remove empty or invalid samples

Output: clean, structured dataset

5. Embedding Generation

Two embedding flows:

5.1 Local Embeddings (Per Upload)

Used for:

redundancy coherence diversity factual contradictions clustering/domain analysis

These embeddings exist only for the request and are deleted afterward.

5.2 Global Reference Embeddings (Static)

Used only for novelty detection.

Pre-built FAISS/Vector DB containing ~1M representative samples:

Wikipedia Common Crawl samples C4 slices StackOverflow Books corpus Public domain corpora

This is static, never modified by user uploads.

6. Metric Computation

DQS computes 10 core quality metrics:

6.1 Redundancy Score compute embedding similarity within dataset clustering density = redundancy score = inverse redundancy 6.2 Malware / Toxicity Score run samples through pre-trained toxicity classifier aggregate severity 6.3 Diversity Score linguistic diversity (entropy, vocab richness) semantic diversity (embedding variance) 6.4 Readability Score Flesch–Kincaid sentence complexity coherency heuristics 6.5 Semantic Coherence embedding flow consistency perplexity using a small reference LLM 6.6 Novelty Score compare against global reference corpus nearest neighbor distance = novelty measure 6.7 Structure Quality

Applicable to:

JSON code SQL XML YAML

Checks:

syntax validity AST parsing success 6.8 Factual Conflict Score sample random pairs pass to NLI contradiction model aggregate contradictions 6.9 Domain Balance Score cluster dataset embeddings measure cluster distribution via entropy 6.10 Length Distribution Score detect outliers analyze token distribution

7. Composite Score Calculation

All metrics normalized 0–100.

Weighted aggregation formula:

overall_score = 0.15redundancy + 0.10toxicity + 0.10diversity + 0.10readability + 0.10coherence + 0.10novelty + 0.10structure + 0.10factual_conflict + 0.075domain_balance + 0.075length_distribution

8. Report Generation

Output includes:

8.1 JSON Report

Contains:

overall_score all sub-scores dataset metadata top detected issues summary of duplicates domain distribution histogram 8.2 Human-Readable Text Report simple explanations listed issues recommendations optional PDF

9. System Architecture

Components API Layer file upload async processing report delivery Compute Engine embeddings scoring logic batching concurrency optimized Reference Store FAISS/Qdrant global novelty index static Models Folder toxicity classifier contradiction/NLI model small LLM for perplexity

10. Execution Flow Diagram

[Upload] ↓ [Preprocess] ↓ [Generate Local Embeddings] ↓ [Compute All Self-Contained Metrics] ↓ [Compare with Global Reference Embeddings] ↓ [Aggregate Scores] ↓ [Generate JSON + Text Report] ↓ [Return to User] ↓ [Delete all temp embeddings + data]

11. Privacy Model

No dataset stored after processing No embeddings stored Only the report is saved (optional) Global reference embeddings NEVER contain user data Fully GDPR-safe

12. MVP Boundary (Important)

Not included in v1:

dataset cleaning dataset repair dataset marketplace collaborative annotation data augmentation agentic workflows

You stay laser-focused on: analysis → scoring → reporting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsqus-0.0.8.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsqus-0.0.8-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file dsqus-0.0.8.tar.gz.

File metadata

  • Download URL: dsqus-0.0.8.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dsqus-0.0.8.tar.gz
Algorithm Hash digest
SHA256 a0535a3183fa98c5730d30291405d08631469e50c124297ad623a158e5060716
MD5 f669825a81453f5e690cfe1961497665
BLAKE2b-256 e0bbdcffe926e436ec4fd0e5b45f89685a58570cb4322d5927b7c68ba799c848

See more details on using hashes here.

File details

Details for the file dsqus-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: dsqus-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dsqus-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 371afb8194a6d420683761e320e5364de01d99f3345145e4b56bae6ca2231d9b
MD5 4d5d060438ed1c6ef37e6a4607ec482d
BLAKE2b-256 bdd496c0f07464b0cc22044b16561bdb79197c860f3c74b7eb408b5926b72550

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page