Skip to main content

llmfsd: LLM Fake Structured Data, faking Structured Data from any LLM

Project description

llmfsd: LLM Fake Structured Data

llmfsd is a Python package designed to generate fake structured data using any Large Language Model (LLM). With this package, you can execute SQL-like queries to simulate structured data in formats like JSON or CSV. The tool is highly customizable and supports the integration of multiple AI providers (thanks aisuite).

Features

  • Generate fake structured data via SQL queries.
  • Supports JSON and CSV output formats.
  • Language selection for descriptive attributes.
  • Define custom data models to control schema and descriptions.
  • Integrates with various AI providers (e.g., OpenAI, Mistral, Google, Anthropic).

Installation

Install llmfsd using pip:

pip install llmfsd

Install a Provider’s Package Along with aisuite

llmfsd supports all AI providers supported by aisuite. If you have not already installed the provider’s package, you can do so along with llmfsd. For example:

pip install "llmfsd[mistral]"

Alternatively, you can install the provider’s package directly with aisuite:

pip install "aisuite[mistral]"

For more details, visit the aisuite repository.

Usage

Basic Example

Here’s a simple example to get started:

from llmfsd import Faker

# Initialize Faker with your LLM model ID (AISuite ID format)
faker = Faker(model_id="mistral:mistral-large-latest")

# Or specify a language for descriptive attributes. Defaults to English.
faker = Faker(model_id="mistral:mistral-large-latest", lang="french")

# Generate JSON data
print(faker.json("SELECT uuid, name FROM phone_brands LIMIT 4"))

"""
Output:
[
 {'uuid': 'f47ac10b-58cc-4372-a567-0e02b2c3d479', 'name': 'Nokia'},
 {'uuid': 'f7bac13b-58cc-4372-a567-0e02b2c3d479', 'name': 'Samsung'}, 
 {'uuid': 'f98ac12b-58cc-4372-a567-0e02b2c3d479', 'name': 'Apple'},
 {'uuid': 'f47ac10b-58cc-4972-a567-0e02b2c3d479', 'name': 'Sony'}
]
"""

# Generate CSV data
print(faker.csv("SELECT id, color FROM colors LIMIT 2"))

"""
Output:
id,color
1,red
2,blue
"""

More Advanced Example with Data Models

You can define custom data models to control the structure of your fake data.

from llmfsd import Faker, DataModel

# Define data models

model = DataModel("dogs", 
    {"id": "Number in range(5,20)", "name": None, "breed": "Breed of the dog"}
)

# Initialize Faker with data models
faker = Faker(model_id="mistral:mistral-large-latest", data_models=[model])

# Generate JSON data for a specific model
print(faker.json("SELECT * FROM dogs LIMIT 3"))

"""
Output:
[
  {
    "id": 7,
    "name": "Buddy",
    "breed": "Labrador"
  },
  {
    "id": 12,
    "name": "Charlie",
    "breed": "Golden Retriever"
  },
  {
    "id": 15,
    "name": "Max",
    "breed": "German Shepherd"
  }
]
"""

AI Providers

To initialize with different providers, set the model_id parameter during Faker initialization using aisuite format.

Examples

faker1 = Faker(model_id="groq:llama-3.2-3b-preview")

faker2 = Faker(model_id="openai:gpt-3.5-turbo")

faker3 = Faker(model_id="huggingface:mistralai/Mistral-7B-Instruct-v0.3")

Each provider requires proper API_KEY. Use environment variables or configuration files to store your API keys securely. For example you need mistral you need MISTRAL_API_KEY

export MISTRAL_API_KEY="your-mistral-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Methods

json(query: str, output: Optional[str] = None) -> list[dict] | None

Generate fake structured data in JSON format.

  • query: The SQL query to execute.
  • output: File path to save the JSON output. If None, returns the data directly.

csv(query: str, output: Optional[str] = None) -> str | None

Generate fake structured data in CSV format.

  • query: The SQL query to execute.
  • output: File path to save the CSV output. If None, returns the data directly.

Custom Data Models

You can create custom schemas using DataModel, defining either a list of attributes or a dictionary with descriptions.

DataModel allows you to use * as a wildcard in queries or provide minimal descriptions for your attributes to the LLM.

Avoid providing unnecessary descriptions, as they can increase token consumption. It is recommended to use a list of attributes if the attributes are self-explanatory for the LLM. When using a dictionary-based schema, you can leave None for some attributes and provide descriptions only for those you wish to clarify.

Example:

from llmfsd import DataModel

Schema as a list

model1 = DataModel("cars", ["brand", "model", "year"])

Schema as a dictionary

model2 = DataModel("pets", {
    "id" : "uuid string",
    "name": None,
    "age":  None,
    "species": "Type of pet (e.g., dog, cat)"
})

Pass these models to Faker during initialization:

faker = Faker(model_id="openai:gpt-4o", data_models=[model1, model2])

Saving Output to a File

Both json and csv methods support saving results directly to a file.

Save JSON data to a file

faker.json("SELECT * FROM artists LIMIT 20", output="artists.json")

Save CSV data to a file

faker.csv("SELECT name, age FROM pets LIMIT 20", output="pets.csv")

Github

https://github.com/dinyad-prog00/llmfsd

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmfsd-0.1.3.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmfsd-0.1.3-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file llmfsd-0.1.3.tar.gz.

File metadata

  • Download URL: llmfsd-0.1.3.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.15 Darwin/24.0.0

File hashes

Hashes for llmfsd-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e96739e27e93668d4c6f2f4bb63ca042673d4a7cdbd3f9d57c0550a7959198e0
MD5 ea1ab432f9b698f3ef3f89c99c2a0eef
BLAKE2b-256 10133d6d6a942b4ea627ae0cc7257760ba1d73a8d89cdb98d9d8f488fd727ae5

See more details on using hashes here.

File details

Details for the file llmfsd-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: llmfsd-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.15 Darwin/24.0.0

File hashes

Hashes for llmfsd-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fb2800f2e555c67235168b8e05c5e6302256adfdffa5527d6516e0eb8d9d7530
MD5 82291f8e226365b44f5926734fa2c9de
BLAKE2b-256 dc4b60754247f267a7d8ed7851faace5480593d2250262fe9361202201998bbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page