Skip to main content

Generate realistic fake data that mirrors your real data's shape — safe to share with LLMs.

Project description

FakeSmith

PyPI Version Python Versions Tests License

Generate realistic fake data that mirrors your real data's shape — safe to share with LLMs, teammates, or in public repos.

A Python package and CLI that converts real configs, payloads, logs, and datasets into schema-preserving synthetic versions safe to share with LLMs. Because LLM-safe sanitization of real developer artifacts is a real and growing workflow problem.

When you share code with an AI assistant, you shouldn't have to expose real emails, API keys, card numbers, or user data. FakeSmith lets you describe (or just paste) a sample of your data and instantly get structurally identical but completely fake replacements.


Install

pip install fakesmith

Quick Start

Option 1 — Auto-detect from a sample

from fakesmith import FakeSmith

# Paste a real (or representative) sample — FakeSmith reads its shape
sample = '''[{
    "user_id": "3f2e1a4b-0000-0000-0000-000000000000",
    "email": "john.doe@company.com",
    "phone": "+1-800-555-0199",
    "api_key": "sk-abc123def456ghi789jkl012",
    "amount": 199.99,
    "status": "active",
    "created_at": "2024-01-15T09:30:00"
}]'''

smith = FakeSmith.from_sample(sample)
smith.describe()          # see what was detected
print(smith.to_json(5))   # 5 fake records, same shape
from fakesmith import FakeSmith, SchemaField, FieldType

smith = FakeSmith([
    SchemaField("user_id",  FieldType.UUID),
    SchemaField("email",    FieldType.EMAIL),
    SchemaField("name",     FieldType.FULL_NAME),
    SchemaField("amount",   FieldType.AMOUNT, min_value=10, max_value=5000),
    SchemaField("status",   FieldType.STATUS, choices=["active", "inactive", "pending"]),
    SchemaField("api_key",  FieldType.API_KEY, prefix="sk-live-"),
])

# Generate deterministic records with a seed
result = smith.generate(10, seed=42)
result.print_summary()  # See which fields were faked
records = result.records # Access the list of dicts

Option 3 — Quick dict shorthand

smith = FakeSmith.from_dict({
    "id":       FieldType.UUID,
    "email":    FieldType.EMAIL,
    "score":    FieldType.INTEGER,
    "verified": FieldType.BOOLEAN,
})

Output Formats

smith.to_json(10)                          # JSON string
smith.to_csv(10)                           # CSV string
smith.to_sql(10, table_name="users")       # SQL INSERT statements
smith.to_env()                             # .env file format

smith.save_json("fake_users.json", 100)    # save to file
smith.save_csv("fake_users.csv",  100)
smith.save_sql("seed.sql",        100, table_name="users")
smith.save_env(".env.fake")

CLI

# Generate 20 fake records from a JSON sample
fakesmith generate --file real_sample.json --count 20 --format json

# From CSV, output as SQL inserts
fakesmith generate --file data.csv --count 50 --format sql --table transactions

# Deterministic output using a seed
fakesmith generate --file data.json --seed 42 --out fake_data.json

# Sanitize raw text (log lines, configs) in-place
fakesmith sanitize --file server.log --out clean.log --summary

# Inspect detected schema and sensitivity flags
fakesmith describe --file data.json

In-place Sanitization

FakeSmith can scan raw text (log lines, configuration blocks, or emails) and replace PII/secrets in-place without needing a schema.

from fakesmith import sanitize_text

raw_text = "My email is alex@example.com and my key is sk-12345"
result = sanitize_text(raw_text, seed=42)

print(result.sanitized)
# "My email is fake.user@domain.com and my key is sk-a1b2c3d4..."

result.print_summary() # See exactly what was replaced and why

Run the Samples

Try out FakeSmith on the included sample datasets (JSON, CSV, and .env) using the demo script:

  1. Setup Environment

    python3 -m venv venv
    source venv/bin/activate
    pip install faker pytest
    
  2. Run the Samples To run any script in the examples/ folder while working on the source code, you must set the PYTHONPATH to the current directory:

    # Set PYTHONPATH to the root so Python can find the 'fakesmith' package
    export PYTHONPATH=$PYTHONPATH:.
    
    # Run the main demo
    python3 examples/demo_all.py
    
    # Or run any individual sample
    python3 examples/export_to_sql_csv.py
    python3 examples/sanitize_logs_in_place.py
    
  3. Explore the examples/ directory The examples/ folder contains several targeted scripts illustrating different features (auto-detection, manual schemas, in-place sanitization, etc.).


Override Auto-Detection

smith = FakeSmith.from_sample(
    my_json,
    overrides={
        # Auto-detected "status" as SENTENCE — override to proper STATUS
        "status": SchemaField("status", FieldType.STATUS, choices=["open", "closed", "resolved"]),
        # Keep a realistic amount range
        "balance": SchemaField("balance", FieldType.AMOUNT, min_value=0, max_value=100000),
    }
)

Custom Fields

import random

smith = FakeSmith([
    SchemaField("ref_code", FieldType.CUSTOM,
        generator=lambda: f"REF-{random.randint(10000, 99999)}"
    ),
    SchemaField("tier", FieldType.CUSTOM,
        generator=lambda: random.choice(["bronze", "silver", "gold", "platinum"])
    ),
])

Supported Field Types

Category Types
Identity UUID, FULL_NAME, FIRST_NAME, LAST_NAME, USERNAME, EMAIL, PHONE, PASSWORD, PASSWORD_HASH
Location ADDRESS, CITY, STATE, COUNTRY, ZIP_CODE, LATITUDE, LONGITUDE
Finance CARD_NUMBER, CARD_EXPIRY, CARD_CVV, BANK_ACCOUNT, IBAN, AMOUNT, CURRENCY
Business COMPANY, JOB_TITLE, DEPARTMENT, API_KEY, SECRET_TOKEN, JWT_TOKEN, WEBHOOK_URL
Dates DATETIME, DATE, TIME, DATE_OF_BIRTH, TIMESTAMP
Web & Tech IP_ADDRESS, IPV6, MAC_ADDRESS, USER_AGENT, URL, DOMAIN, SLUG, JWT_TOKEN
Content WORD, SENTENCE, PARAGRAPH, TITLE, DESCRIPTION, TAG
Numeric INTEGER, FLOAT, BOOLEAN, PERCENTAGE
Enums STATUS, GENDER, CUSTOM

Why FakeSmith?

  • LLM-safe — no real credentials, PII, or secrets ever leave your machine
  • Zero config — paste a sample and go
  • Structurally identical — same field names, same types, realistic values
  • All formats — JSON, CSV, SQL, .env
  • Extensible — override any field with a custom generator

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fakesmith-0.1.1.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fakesmith-0.1.1-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file fakesmith-0.1.1.tar.gz.

File metadata

  • Download URL: fakesmith-0.1.1.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fakesmith-0.1.1.tar.gz
Algorithm Hash digest
SHA256 08699dcf7e52bafe0a7062f7a61e926bd2b5d1522fd1ee12f36f30c14166efe6
MD5 68b7fec4e40637d2ad1ef6615025f33f
BLAKE2b-256 170b0088a705a94afe44ac822dc955e10aa7fbe681935804802c793f989a5e3d

See more details on using hashes here.

File details

Details for the file fakesmith-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fakesmith-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fakesmith-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 322c5f24073008a9869f635248703bf30318d666194156cb6a4dca22500c4a30
MD5 1bd4ab2707186ab22daf89df4c2bde9a
BLAKE2b-256 aa2fc022f1ea456c2b4cc4181668fb4ecd7c021089d299a5774af229fc8edd7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page