Skip to main content

dsqusss Python package

Project description

🧩 Dataset Quality Scoring Engine — System Framework (Markdown)

Local Development Split

The repository now separates responsibilities:

  • dsqus/engine: core Python package code (no FastAPI).
  • backend: FastAPI service that imports and calls the core package.
  • ui: browser frontend that uploads files to the backend.

Run Upload Flow End-to-End

From repository root:

python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r backend/requirements.txt
uvicorn backend.app:app --reload

From ui in another terminal:

npm install
npm run dev

Then open the Vite URL (typically http://localhost:5173) and upload a CSV/XLS/XLSX file.

#️⃣ 1. Overview

The Dataset Quality Scoring Engine (DQS) evaluates the quality of any dataset using automated, model-agnostic metrics. The system processes user-uploaded datasets, computes embeddings, analyzes statistical and semantic properties, and outputs a standardized quality score (0–100) along with detailed submetrics.

2. High-Level Workflow

User Upload → Preprocessing → Embedding → Metric Computation → Scoring → Report Generation → Cleanup

3. Input Specifications

The system accepts:

jsonl json txt csv folder of text/code files PDFs (extracted into text)

4. Preprocessing Pipeline

Validate file format Convert to normalized internal format (list[str or dict]) Clean text: remove control chars normalize whitespace optional: strip HTML/markup Segment long documents into meaningful chunks Remove empty or invalid samples

Output: clean, structured dataset

5. Embedding Generation

Two embedding flows:

5.1 Local Embeddings (Per Upload)

Used for:

redundancy coherence diversity factual contradictions clustering/domain analysis

These embeddings exist only for the request and are deleted afterward.

5.2 Global Reference Embeddings (Static)

Used only for novelty detection.

Pre-built FAISS/Vector DB containing ~1M representative samples:

Wikipedia Common Crawl samples C4 slices StackOverflow Books corpus Public domain corpora

This is static, never modified by user uploads.

6. Metric Computation

DQS computes 10 core quality metrics:

6.1 Redundancy Score compute embedding similarity within dataset clustering density = redundancy score = inverse redundancy 6.2 Malware / Toxicity Score run samples through pre-trained toxicity classifier aggregate severity 6.3 Diversity Score linguistic diversity (entropy, vocab richness) semantic diversity (embedding variance) 6.4 Readability Score Flesch–Kincaid sentence complexity coherency heuristics 6.5 Semantic Coherence embedding flow consistency perplexity using a small reference LLM 6.6 Novelty Score compare against global reference corpus nearest neighbor distance = novelty measure 6.7 Structure Quality

Applicable to:

JSON code SQL XML YAML

Checks:

syntax validity AST parsing success 6.8 Factual Conflict Score sample random pairs pass to NLI contradiction model aggregate contradictions 6.9 Domain Balance Score cluster dataset embeddings measure cluster distribution via entropy 6.10 Length Distribution Score detect outliers analyze token distribution

7. Composite Score Calculation

All metrics normalized 0–100.

Weighted aggregation formula:

overall_score = 0.15redundancy + 0.10toxicity + 0.10diversity + 0.10readability + 0.10coherence + 0.10novelty + 0.10structure + 0.10factual_conflict + 0.075domain_balance + 0.075length_distribution

8. Report Generation

Output includes:

8.1 JSON Report

Contains:

overall_score all sub-scores dataset metadata top detected issues summary of duplicates domain distribution histogram 8.2 Human-Readable Text Report simple explanations listed issues recommendations optional PDF

9. System Architecture

Components API Layer file upload async processing report delivery Compute Engine embeddings scoring logic batching concurrency optimized Reference Store FAISS/Qdrant global novelty index static Models Folder toxicity classifier contradiction/NLI model small LLM for perplexity

10. Execution Flow Diagram

[Upload] ↓ [Preprocess] ↓ [Generate Local Embeddings] ↓ [Compute All Self-Contained Metrics] ↓ [Compare with Global Reference Embeddings] ↓ [Aggregate Scores] ↓ [Generate JSON + Text Report] ↓ [Return to User] ↓ [Delete all temp embeddings + data]

11. Privacy Model

No dataset stored after processing No embeddings stored Only the report is saved (optional) Global reference embeddings NEVER contain user data Fully GDPR-safe

12. MVP Boundary (Important)

Not included in v1:

dataset cleaning dataset repair dataset marketplace collaborative annotation data augmentation agentic workflows

You stay laser-focused on: analysis → scoring → reporting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsqus-0.0.9.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsqus-0.0.9-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file dsqus-0.0.9.tar.gz.

File metadata

  • Download URL: dsqus-0.0.9.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dsqus-0.0.9.tar.gz
Algorithm Hash digest
SHA256 5e6d2cbfe7de0cd3e1e8a5e41d0fb4c05395fb0f557c47b1a95f64008229fb54
MD5 9babdc53936cc5597f681b7d27ac4c27
BLAKE2b-256 95918ed9cf448b25bbd9ff9ceedcbde1acd7d53f0b0741eab21a45db2bfd7cda

See more details on using hashes here.

File details

Details for the file dsqus-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: dsqus-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dsqus-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 eaa41795d9b6008590f13d113dffa4eb70d3bbc348d06cbe5ef9031038706bcc
MD5 fb8b75b5c18a758c9833578c975568a9
BLAKE2b-256 c1e46c31d408d7b603ba173d5fcb83fa86c7385bd8ac0785cc7f7b3f90b1398b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page