dsqusss Python package
Project description
🧩 Dataset Quality Scoring Engine — System Framework (Markdown)
#️⃣ 1. Overview
The Dataset Quality Scoring Engine (DQS) evaluates the quality of any dataset using automated, model-agnostic metrics. The system processes user-uploaded datasets, computes embeddings, analyzes statistical and semantic properties, and outputs a standardized quality score (0–100) along with detailed submetrics.
2. High-Level Workflow
User Upload → Preprocessing → Embedding → Metric Computation → Scoring → Report Generation → Cleanup
3. Input Specifications
The system accepts:
jsonl json txt csv folder of text/code files PDFs (extracted into text)
4. Preprocessing Pipeline
Validate file format Convert to normalized internal format (list[str or dict]) Clean text: remove control chars normalize whitespace optional: strip HTML/markup Segment long documents into meaningful chunks Remove empty or invalid samples
Output: clean, structured dataset
5. Embedding Generation
Two embedding flows:
5.1 Local Embeddings (Per Upload)
Used for:
redundancy coherence diversity factual contradictions clustering/domain analysis
These embeddings exist only for the request and are deleted afterward.
5.2 Global Reference Embeddings (Static)
Used only for novelty detection.
Pre-built FAISS/Vector DB containing ~1M representative samples:
Wikipedia Common Crawl samples C4 slices StackOverflow Books corpus Public domain corpora
This is static, never modified by user uploads.
6. Metric Computation
DQS computes 10 core quality metrics:
6.1 Redundancy Score compute embedding similarity within dataset clustering density = redundancy score = inverse redundancy 6.2 Malware / Toxicity Score run samples through pre-trained toxicity classifier aggregate severity 6.3 Diversity Score linguistic diversity (entropy, vocab richness) semantic diversity (embedding variance) 6.4 Readability Score Flesch–Kincaid sentence complexity coherency heuristics 6.5 Semantic Coherence embedding flow consistency perplexity using a small reference LLM 6.6 Novelty Score compare against global reference corpus nearest neighbor distance = novelty measure 6.7 Structure Quality
Applicable to:
JSON code SQL XML YAML
Checks:
syntax validity AST parsing success 6.8 Factual Conflict Score sample random pairs pass to NLI contradiction model aggregate contradictions 6.9 Domain Balance Score cluster dataset embeddings measure cluster distribution via entropy 6.10 Length Distribution Score detect outliers analyze token distribution
7. Composite Score Calculation
All metrics normalized 0–100.
Weighted aggregation formula:
overall_score = 0.15redundancy + 0.10toxicity + 0.10diversity + 0.10readability + 0.10coherence + 0.10novelty + 0.10structure + 0.10factual_conflict + 0.075domain_balance + 0.075length_distribution
8. Report Generation
Output includes:
8.1 JSON Report
Contains:
overall_score all sub-scores dataset metadata top detected issues summary of duplicates domain distribution histogram 8.2 Human-Readable Text Report simple explanations listed issues recommendations optional PDF
9. System Architecture
Components API Layer file upload async processing report delivery Compute Engine embeddings scoring logic batching concurrency optimized Reference Store FAISS/Qdrant global novelty index static Models Folder toxicity classifier contradiction/NLI model small LLM for perplexity
10. Execution Flow Diagram
[Upload] ↓ [Preprocess] ↓ [Generate Local Embeddings] ↓ [Compute All Self-Contained Metrics] ↓ [Compare with Global Reference Embeddings] ↓ [Aggregate Scores] ↓ [Generate JSON + Text Report] ↓ [Return to User] ↓ [Delete all temp embeddings + data]
11. Privacy Model
No dataset stored after processing No embeddings stored Only the report is saved (optional) Global reference embeddings NEVER contain user data Fully GDPR-safe
12. MVP Boundary (Important)
Not included in v1:
dataset cleaning dataset repair dataset marketplace collaborative annotation data augmentation agentic workflows
You stay laser-focused on: analysis → scoring → reporting.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsqus-0.0.8.tar.gz.
File metadata
- Download URL: dsqus-0.0.8.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0535a3183fa98c5730d30291405d08631469e50c124297ad623a158e5060716
|
|
| MD5 |
f669825a81453f5e690cfe1961497665
|
|
| BLAKE2b-256 |
e0bbdcffe926e436ec4fd0e5b45f89685a58570cb4322d5927b7c68ba799c848
|
File details
Details for the file dsqus-0.0.8-py3-none-any.whl.
File metadata
- Download URL: dsqus-0.0.8-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
371afb8194a6d420683761e320e5364de01d99f3345145e4b56bae6ca2231d9b
|
|
| MD5 |
4d5d060438ed1c6ef37e6a4607ec482d
|
|
| BLAKE2b-256 |
bdd496c0f07464b0cc22044b16561bdb79197c860f3c74b7eb408b5926b72550
|