Dataset diversity scoring for synthetic instruction data

These details have not been verified by PyPI

Project links

Homepage

Project description

Samey

Dataset Diversity Scoring

for synthetic instruction data (SFT/DPO)

Samey measures diversity, repetition, templating, and topic coverage of text datasets. Fast and CPU-aimed.

Installation

pip install samey

Quickstart

from samey import Samey
import pandas as pd

data = pd.read_json("my_dataset.jsonl", lines=True)

model = Samey()
report = model.score(data, text="prompt", topic="category")
print(report.summary)

# OR as a single number: 
print("Diversity:", report.diversity_score['score'])

One-liner usage

import samey as sl

report = sl.score(df, text="prompt", topic="topic")
print(report.summary)

DPO datasets

report = sl.score_dpo(df, prompt="prompt", chosen="chosen", rejected="rejected")
print(report.to_markdown())

Multiple text columns

report = sl.score(df, text=["prompt", "response"])
report.to_json("diversity_report.json")

Metrics

Samey computes 8 metrics:

Metric	What it measures	Healthy range
Compression Ratio	Global repetition via gzip	0.3-0.5
Near-Duplicate Rate	MinHash/LSH duplicates	< 0.1
Template Dominance	Skeleton detection	< 0.1 (top skeleton share)
N-gram Repetition	Boilerplate via repeated 6-10 grams	< 0.2
Topic Coverage	Topic entropy	> 0.8 (1=uniform)
Style Diversity	Char n-gram clustering	< 0.2 (largest cluster)
Semantic Diversity	Embedding-based concept spread	> 0.5 (higher=more diverse)
Distinct-N	Lexical diversity	> 0.5 for distinct-1/2/3

Configuration

model = Samey(
    length_mode="truncate",  # "truncate", "window", or "none"
    max_chars=512,
    shingle_size=5,
    lsh_threshold=0.85,
    max_sample=50_000,
    ngram_min=6,
    ngram_max=10,
    style_n_clusters=20,
    # Semantic diversity settings
    semantic_method="tfidf",  # "tfidf" (fast) or "embedding" (better)
    semantic_model="paraphrase-MiniLM-L3-v2",  # Only for method="embedding"
    semantic_max_sample=1000,
    enable_semantic=True,
    seed=42,
)

Report Object

report = model.score(df, text="prompt")

report.summary          # Key metrics dict
report.table            # pandas DataFrame
report.diversity_score  # Aggregated 0-100 score
report.print_score()    # Formatted score report
report.to_json("report.json")
report.to_markdown()

Aggregated Diversity Score

Get a single 0-100 score combining all metrics:

report = model.score(df, text="prompt")
report.print_score()

Output:

DIVERSITY SCORE: 97.8/100 (A)

Metric Breakdown (1.0 = best):
  compression_ratio              ██████████████████░░ 0.92 ✓
  near_duplicate_rate            ████████████████████ 1.00 ✓
  distinct_2                     ████████████████████ 1.00 ✓
  ...

✅ No significant issues detected!

Access programmatically:

ds = report.diversity_score
print(ds['score'])      # 97.8
print(ds['issues'])     # List of detected problems
print(ds['breakdown'])  # Per-metric normalized scores

Saving and Loading

model = Samey(max_chars=256, lsh_threshold=0.9)
model.save("my_config")

model = Samey.load("my_config")

How It Works

Compression Ratio

Concatenates all texts and computes gzip_bytes / raw_bytes. Repetitive content compresses better.

Near-Duplicate Rate

Uses character 5-gram shingles, MinHash signatures (128 perms), and LSH to find texts with Jaccard similarity >= 0.85.

Template Dominance

"Skeletonizes" texts by replacing URLs, numbers, emails, code blocks, quoted strings with tags. Then measures skeleton distribution.

N-gram Repetition

Finds word 6-10 grams appearing in 2+ different rows.

Topic Coverage

Normalized entropy of topic labels (0 = one topic, 1 = uniform).

Style Diversity

Character 3-5 gram TF-IDF + MiniBatchKMeans clustering.

Distinct-N

unique_ngrams / total_ngrams for unigrams, bigrams, trigrams.

Semantic Diversity

Two methods available:

TF-IDF (default): Fast, uses word/bigram TF-IDF vectors. No extra dependencies.
Embedding: Uses paraphrase-MiniLM-L3-v2 sentence transformer. Better at catching paraphrases/synonyms, but slower.

# Fast TF-IDF (default)
model = Samey(semantic_method="tfidf")

# Embedding-based (needs sentence-transformers)
model = Samey(semantic_method="embedding")

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samey-0.1.0.tar.gz (18.9 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

samey-0.1.0-py3-none-any.whl (18.6 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file samey-0.1.0.tar.gz.

File metadata

Download URL: samey-0.1.0.tar.gz
Upload date: Feb 1, 2026
Size: 18.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for samey-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`36e1285b9c8ea9df9ad9ff7ed477dc87be5124da2d39acfb84049fd298a3052e`
MD5	`cd8a93a543b9591af8cc81cd62869e85`
BLAKE2b-256	`88e6db2ae64b3230db6c3df28c30af3c7c8ddca55eae7f4183cff1b7a409e413`

See more details on using hashes here.

File details

Details for the file samey-0.1.0-py3-none-any.whl.

File metadata

Download URL: samey-0.1.0-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 18.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for samey-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`29beff73ab8da65a60b270d580a198f06278dd4ce5df63d232c0e3f6e1b9ffa2`
MD5	`82dfba3d90d7a821262771e24ad2c957`
BLAKE2b-256	`2234138b565e10d3f4b270ddda97e4cd2659546e57a63ccf4a70660e967b33de`

See more details on using hashes here.

samey 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Samey

Installation

Quickstart

One-liner usage

DPO datasets

Multiple text columns

Metrics

Configuration

Report Object

Aggregated Diversity Score

Saving and Loading

How It Works

Compression Ratio

Near-Duplicate Rate

Template Dominance

N-gram Repetition

Topic Coverage

Style Diversity

Distinct-N

Semantic Diversity

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes