Dataset diversity scoring for synthetic instruction data
Project description
Samey
Dataset Diversity Scoring
for synthetic instruction data (SFT/DPO)
Samey measures diversity, repetition, templating, and topic coverage of text datasets. Fast and CPU-aimed.
Installation
pip install samey
Quickstart
from samey import Samey
import pandas as pd
data = pd.read_json("my_dataset.jsonl", lines=True)
model = Samey()
report = model.score(data, text="prompt", topic="category")
print(report.summary)
# OR as a single number:
print("Diversity:", report.diversity_score['score'])
One-liner usage
import samey as sl
report = sl.score(df, text="prompt", topic="topic")
print(report.summary)
DPO datasets
report = sl.score_dpo(df, prompt="prompt", chosen="chosen", rejected="rejected")
print(report.to_markdown())
Multiple text columns
report = sl.score(df, text=["prompt", "response"])
report.to_json("diversity_report.json")
Metrics
Samey computes 8 metrics:
| Metric | What it measures | Healthy range |
|---|---|---|
| Compression Ratio | Global repetition via gzip | 0.3-0.5 |
| Near-Duplicate Rate | MinHash/LSH duplicates | < 0.1 |
| Template Dominance | Skeleton detection | < 0.1 (top skeleton share) |
| N-gram Repetition | Boilerplate via repeated 6-10 grams | < 0.2 |
| Topic Coverage | Topic entropy | > 0.8 (1=uniform) |
| Style Diversity | Char n-gram clustering | < 0.2 (largest cluster) |
| Semantic Diversity | Embedding-based concept spread | > 0.5 (higher=more diverse) |
| Distinct-N | Lexical diversity | > 0.5 for distinct-1/2/3 |
Configuration
model = Samey(
length_mode="truncate", # "truncate", "window", or "none"
max_chars=512,
shingle_size=5,
lsh_threshold=0.85,
max_sample=50_000,
ngram_min=6,
ngram_max=10,
style_n_clusters=20,
# Semantic diversity settings
semantic_method="tfidf", # "tfidf" (fast) or "embedding" (better)
semantic_model="paraphrase-MiniLM-L3-v2", # Only for method="embedding"
semantic_max_sample=1000,
enable_semantic=True,
seed=42,
)
Report Object
report = model.score(df, text="prompt")
report.summary # Key metrics dict
report.table # pandas DataFrame
report.diversity_score # Aggregated 0-100 score
report.print_score() # Formatted score report
report.to_json("report.json")
report.to_markdown()
Aggregated Diversity Score
Get a single 0-100 score combining all metrics:
report = model.score(df, text="prompt")
report.print_score()
Output:
DIVERSITY SCORE: 97.8/100 (A)
Metric Breakdown (1.0 = best):
compression_ratio ██████████████████░░ 0.92 ✓
near_duplicate_rate ████████████████████ 1.00 ✓
distinct_2 ████████████████████ 1.00 ✓
...
✅ No significant issues detected!
Access programmatically:
ds = report.diversity_score
print(ds['score']) # 97.8
print(ds['issues']) # List of detected problems
print(ds['breakdown']) # Per-metric normalized scores
Saving and Loading
model = Samey(max_chars=256, lsh_threshold=0.9)
model.save("my_config")
model = Samey.load("my_config")
How It Works
Compression Ratio
Concatenates all texts and computes gzip_bytes / raw_bytes. Repetitive content compresses better.
Near-Duplicate Rate
Uses character 5-gram shingles, MinHash signatures (128 perms), and LSH to find texts with Jaccard similarity >= 0.85.
Template Dominance
"Skeletonizes" texts by replacing URLs, numbers, emails, code blocks, quoted strings with tags. Then measures skeleton distribution.
N-gram Repetition
Finds word 6-10 grams appearing in 2+ different rows.
Topic Coverage
Normalized entropy of topic labels (0 = one topic, 1 = uniform).
Style Diversity
Character 3-5 gram TF-IDF + MiniBatchKMeans clustering.
Distinct-N
unique_ngrams / total_ngrams for unigrams, bigrams, trigrams.
Semantic Diversity
Two methods available:
- TF-IDF (default): Fast, uses word/bigram TF-IDF vectors. No extra dependencies.
- Embedding: Uses
paraphrase-MiniLM-L3-v2sentence transformer. Better at catching paraphrases/synonyms, but slower.
# Fast TF-IDF (default)
model = Samey(semantic_method="tfidf")
# Embedding-based (needs sentence-transformers)
model = Samey(semantic_method="embedding")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file samey-0.1.0.tar.gz.
File metadata
- Download URL: samey-0.1.0.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36e1285b9c8ea9df9ad9ff7ed477dc87be5124da2d39acfb84049fd298a3052e
|
|
| MD5 |
cd8a93a543b9591af8cc81cd62869e85
|
|
| BLAKE2b-256 |
88e6db2ae64b3230db6c3df28c30af3c7c8ddca55eae7f4183cff1b7a409e413
|
File details
Details for the file samey-0.1.0-py3-none-any.whl.
File metadata
- Download URL: samey-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29beff73ab8da65a60b270d580a198f06278dd4ce5df63d232c0e3f6e1b9ffa2
|
|
| MD5 |
82dfba3d90d7a821262771e24ad2c957
|
|
| BLAKE2b-256 |
2234138b565e10d3f4b270ddda97e4cd2659546e57a63ccf4a70660e967b33de
|