LLM-powered synthetic text data generation for text classification tasks, with multi-strategy generation, multilingual support, and quality filtering.

These details have not been verified by PyPI

Project links

Project description

synthetictext

LLM-powered synthetic text data generation for text classification tasks.

synthetictext generates high-quality synthetic training data for any text classification task across multiple languages. It provides five generation strategies, a multi-stage quality filtering pipeline, and a simple Python API.

Features

Task-agnostic: Define any binary or multi-class text classification task via a TaskSpec
5 generation strategies: Direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
Multi-stage quality filtering: Deduplication, label leakage detection, embedding-based dedup, LLM-as-judge, and keyword marker checks
Multilingual: Generate data in any language supported by your LLM provider, with optional cross-lingual transfer for low-resource languages
Provider-agnostic: Built-in support for OpenAI; extensible via BaseLLMProvider and BaseTranslationProvider interfaces
CLI and Python API: Use from scripts or the command line

Installation

# Core (no LLM provider included)
pip install synthetictext

# With OpenAI support (most common)
pip install synthetictext[openai]

# With embedding-based deduplication
pip install synthetictext[embeddings]

# With Google Cloud Translation (for backtranslation/pivot strategies)
pip install synthetictext[google-translate]

# With YAML config file support for the CLI
pip install synthetictext[yaml]

# Everything
pip install synthetictext[all]

Authentication

The library needs an API key for the LLM provider. There are three ways to provide it:

Option 1: Environment variable (recommended)

export OPENAI_API_KEY="sk-..."

Then just use the shorthand -- the OpenAI SDK picks up the env var automatically:

generator = SyntheticDataGenerator(task=task, llm_provider="openai")

Option 2: Explicit API key

generator = SyntheticDataGenerator(
    task=task,
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    api_key="sk-...",
)

Option 3: Direct provider construction (full control)

from synthetictext.providers.openai_provider import OpenAIProvider

provider = OpenAIProvider(api_key="sk-...", default_model="gpt-4o")
generator = SyntheticDataGenerator(task=task, llm_provider=provider)

The model defaults to gpt-4o-mini but can be overridden via llm_model or --model on the CLI.

Quick Start

from synthetictext import TaskSpec, SyntheticDataGenerator

# 1. Define your classification task
task = TaskSpec(
    name="Sentiment Analysis",
    labels={0: "negative", 1: "positive"},
    description="Classify product reviews as positive or negative sentiment.",
    label_descriptions={
        0: "A review expressing dissatisfaction, criticism, or negative experience.",
        1: "A review expressing satisfaction, praise, or positive experience.",
    },
    topics={
        "electronics": ["smartphones", "laptops", "headphones"],
        "food": ["restaurants", "delivery", "recipes"],
    },
)

# 2. Create a generator
generator = SyntheticDataGenerator(
    task=task,
    llm_provider="openai",  # uses OPENAI_API_KEY env var
    llm_model="gpt-4o-mini",
)

# 3. Generate data
df = generator.generate(
    language="English",
    num_samples=1000,
    strategies=["direct", "paraphrase", "contrastive"],
    strategy_weights=[0.5, 0.3, 0.2],
)

print(df.head())
# Columns: id, text, label, source, generated_at, language

Generation Strategies

Strategy	Description	Requires
direct	Generate new samples in the target language	LLM
paraphrase	Rewrite existing samples preserving labels	LLM + training data
contrastive	Generate minimal pairs (one per class, same topic)	LLM
backtranslation	Round-trip translate through a pivot language	Translation API + training data
pivot	Generate in English, translate to target language	LLM + Translation API

Quality Filtering

The default filtering pipeline runs these steps in order:

BasicFilter -- remove empty, null, or out-of-range-length samples
LeakageFilter -- remove samples containing label leakage patterns (e.g., "this is a positive example")
EmbeddingDeduplicator -- remove near-duplicates using multilingual sentence embeddings (cosine similarity > 0.90)
MarkerFilter -- (optional) ensure samples for specific labels contain expected keywords

Additional optional filters:

LLMJudgeFilter -- use an LLM to validate realism, label correctness, clarity, and grammar
TranslationQualityFilter -- check round-trip translation consistency for pivot/backtranslation samples

Multilingual Generation

from synthetictext import LanguageConfig

lang_config = LanguageConfig(
    languages={"en": "English", "de": "German", "es": "Spanish"},
    related_languages={"es": "en"},  # cross-lingual transfer
)

generator = SyntheticDataGenerator(
    task=task,
    llm_provider="openai",
    lang_config=lang_config,
)

# Generate for all configured languages
results = generator.generate_all(num_samples=500)

CLI Usage

# Generate from a JSON task config
synthetictext generate --config task.json --language en --num-samples 1000

# Generate for all languages
synthetictext generate --config task.json --all --num-samples 500 --output-dir ./output

# Multiple strategies with weights
synthetictext generate --config task.json -l en -n 1000 \
    --strategies direct paraphrase contrastive \
    --weights 0.5 0.3 0.2

# Filter existing synthetic data
synthetictext filter --config task.json --input synthetic.csv --output filtered.csv

Task Config File (JSON)

{
    "name": "Toxicity Detection",
    "labels": {"0": "non-toxic", "1": "toxic"},
    "description": "Classify social media posts as toxic or non-toxic.",
    "label_descriptions": {
        "0": "A post that discusses topics respectfully.",
        "1": "A post containing insults, threats, or dehumanizing language."
    },
    "text_domain": "social media post",
    "word_count_range": [20, 80],
    "topics": {
        "political": ["elections", "government policy"],
        "social": ["gender issues", "immigration"]
    }
}

Custom Providers

Extend BaseLLMProvider to use any LLM backend:

from synthetictext.providers.base import BaseLLMProvider

class AnthropicProvider(BaseLLMProvider):
    def generate(self, prompt, *, model=None, temperature=0.9,
                 max_tokens=250, system_prompt=None):
        # Your Anthropic API call here
        ...

Tutorials

Step-by-step Jupyter notebooks in tutorials/:

Quick Start -- defining tasks, generating data, combining strategies
Quality Filtering -- using built-in filters, building custom filter pipelines
Multilingual Generation -- LanguageConfig, backtranslation, pivot translation, tier-based strategies

Examples

examples/polarization_detection.py -- Full recreation of the SemEval-2026 Task 9 pipeline (22 languages)
examples/toxicity_detection.py -- Simple binary toxicity classifier data generation

Development

git clone https://github.com/srikarkashyap/synthetictext.git
cd synthetictext
pip install -e ".[dev]"
pytest

Origin

This library was developed from the synthetic data pipeline used in the PSK system for SemEval-2026 Task 9: Multilingual Polarization Detection, where it generated training data across 22 languages and contributed to a 2nd-place finish.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetictext-0.1.0.tar.gz (47.4 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

synthetictext-0.1.0-py3-none-any.whl (39.7 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file synthetictext-0.1.0.tar.gz.

File metadata

Download URL: synthetictext-0.1.0.tar.gz
Upload date: May 30, 2026
Size: 47.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for synthetictext-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b29fb5fbb5388aced7dc1c8f15af59257c03b3c5e5d98776e0fa550f22e88505`
MD5	`cc8a45c60623b03dba8e8d9ebd7effdb`
BLAKE2b-256	`a75548409d2df9c310ea333aeadbe86a55f1312a9847904d15f7468feb05ccd1`

See more details on using hashes here.

File details

Details for the file synthetictext-0.1.0-py3-none-any.whl.

File metadata

Download URL: synthetictext-0.1.0-py3-none-any.whl
Upload date: May 30, 2026
Size: 39.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for synthetictext-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1d0837fd8b8cc2acf77b03abe49d2095a7bba00424e2d204eaee992a4febbc7`
MD5	`ec41a97862c3f9b780114a65923db06c`
BLAKE2b-256	`abbb0c876635995948c671a3650da858fc93503d4245806bd702bf8e8ee268c0`

See more details on using hashes here.

synthetictext 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

synthetictext

Features

Installation

Authentication

Quick Start

Generation Strategies

Quality Filtering

Multilingual Generation

CLI Usage

Task Config File (JSON)

Custom Providers

Tutorials

Examples

Development

Origin

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes