Skip to main content

Dataset diversity scoring for synthetic instruction data

Project description

Samey

Dataset Diversity Scoring

for synthetic instruction data (SFT/DPO)

Samey measures diversity, repetition, templating, and topic coverage of text datasets. Fast and CPU-aimed.

Installation

pip install samey

Quickstart

from samey import Samey
import pandas as pd

data = pd.read_json("my_dataset.jsonl", lines=True)

model = Samey()
report = model.score(data, text="prompt", topic="category")
print(report.summary)

# OR as a single number: 
print("Diversity:", report.diversity_score['score'])

One-liner usage

import samey as sl

report = sl.score(df, text="prompt", topic="topic")
print(report.summary)

DPO datasets

report = sl.score_dpo(df, prompt="prompt", chosen="chosen", rejected="rejected")
print(report.to_markdown())

Multiple text columns

report = sl.score(df, text=["prompt", "response"])
report.to_json("diversity_report.json")

Metrics

Samey computes 8 metrics:

Metric What it measures Healthy range
Compression Ratio Global repetition via gzip 0.3-0.5
Near-Duplicate Rate MinHash/LSH duplicates < 0.1
Template Dominance Skeleton detection < 0.1 (top skeleton share)
N-gram Repetition Boilerplate via repeated 6-10 grams < 0.2
Topic Coverage Topic entropy > 0.8 (1=uniform)
Style Diversity Char n-gram clustering < 0.2 (largest cluster)
Semantic Diversity Embedding-based concept spread > 0.5 (higher=more diverse)
Distinct-N Lexical diversity > 0.5 for distinct-1/2/3

Configuration

model = Samey(
    length_mode="truncate",  # "truncate", "window", or "none"
    max_chars=512,
    shingle_size=5,
    lsh_threshold=0.85,
    max_sample=50_000,
    ngram_min=6,
    ngram_max=10,
    style_n_clusters=20,
    # Semantic diversity settings
    semantic_method="tfidf",  # "tfidf" (fast) or "embedding" (better)
    semantic_model="paraphrase-MiniLM-L3-v2",  # Only for method="embedding"
    semantic_max_sample=1000,
    enable_semantic=True,
    seed=42,
)

Report Object

report = model.score(df, text="prompt")

report.summary          # Key metrics dict
report.table            # pandas DataFrame
report.diversity_score  # Aggregated 0-100 score
report.print_score()    # Formatted score report
report.to_json("report.json")
report.to_markdown()

Aggregated Diversity Score

Get a single 0-100 score combining all metrics:

report = model.score(df, text="prompt")
report.print_score()

Output:

DIVERSITY SCORE: 97.8/100 (A)

Metric Breakdown (1.0 = best):
  compression_ratio              ██████████████████░░ 0.92 ✓
  near_duplicate_rate            ████████████████████ 1.00 ✓
  distinct_2                     ████████████████████ 1.00 ✓
  ...

✅ No significant issues detected!

Access programmatically:

ds = report.diversity_score
print(ds['score'])      # 97.8
print(ds['issues'])     # List of detected problems
print(ds['breakdown'])  # Per-metric normalized scores

Saving and Loading

model = Samey(max_chars=256, lsh_threshold=0.9)
model.save("my_config")

model = Samey.load("my_config")

How It Works

Compression Ratio

Concatenates all texts and computes gzip_bytes / raw_bytes. Repetitive content compresses better.

Near-Duplicate Rate

Uses character 5-gram shingles, MinHash signatures (128 perms), and LSH to find texts with Jaccard similarity >= 0.85.

Template Dominance

"Skeletonizes" texts by replacing URLs, numbers, emails, code blocks, quoted strings with tags. Then measures skeleton distribution.

N-gram Repetition

Finds word 6-10 grams appearing in 2+ different rows.

Topic Coverage

Normalized entropy of topic labels (0 = one topic, 1 = uniform).

Style Diversity

Character 3-5 gram TF-IDF + MiniBatchKMeans clustering.

Distinct-N

unique_ngrams / total_ngrams for unigrams, bigrams, trigrams.

Semantic Diversity

Two methods available:

  • TF-IDF (default): Fast, uses word/bigram TF-IDF vectors. No extra dependencies.
  • Embedding: Uses paraphrase-MiniLM-L3-v2 sentence transformer. Better at catching paraphrases/synonyms, but slower.
# Fast TF-IDF (default)
model = Samey(semantic_method="tfidf")

# Embedding-based (needs sentence-transformers)
model = Samey(semantic_method="embedding")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samey-0.1.0.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samey-0.1.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file samey-0.1.0.tar.gz.

File metadata

  • Download URL: samey-0.1.0.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for samey-0.1.0.tar.gz
Algorithm Hash digest
SHA256 36e1285b9c8ea9df9ad9ff7ed477dc87be5124da2d39acfb84049fd298a3052e
MD5 cd8a93a543b9591af8cc81cd62869e85
BLAKE2b-256 88e6db2ae64b3230db6c3df28c30af3c7c8ddca55eae7f4183cff1b7a409e413

See more details on using hashes here.

File details

Details for the file samey-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: samey-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for samey-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29beff73ab8da65a60b270d580a198f06278dd4ce5df63d232c0e3f6e1b9ffa2
MD5 82dfba3d90d7a821262771e24ad2c957
BLAKE2b-256 2234138b565e10d3f4b270ddda97e4cd2659546e57a63ccf4a70660e967b33de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page