Skip to main content

llmfsd: LLM Fake Structured Data, faking Structured Data from any LLM

Project description

llmfsd: LLM Fake Structured Data

llmfsd is a Python package designed to generate fake structured data using any Large Language Model (LLM). With this package, you can execute SQL-like queries to simulate structured data in formats like JSON or CSV. The tool is highly customizable and supports the integration of multiple AI providers (thanks aisuite).

Features

  • Generate fake structured data via SQL queries.
  • Supports JSON and CSV output formats.
  • Define custom data models to control schema and descriptions.
  • Integrates with various AI providers (e.g., OpenAI, Mistral, Google, Anthropic).

Installation

Install llmfsd using pip:

pip install llmfsd

Install a Provider’s Package Along with aisuite

llmfsd supports all AI providers supported by aisuite. If you have not already installed the provider’s package, you can do so along with llmfsd. For example:

pip install "llmfsd[mistral]"

Alternatively, you can install the provider’s package directly with aisuite:

pip install "aisuite[mistral]"

For more details, visit the aisuite repository.

Usage

Basic Example

Here’s a simple example to get started:

from llmfsd import Faker

# Initialize Faker with your LLM model_id (aisuite id format)
faker = Faker(model_id="mistral:mistral-large-latest")

# Generate JSON data
print(faker.json("SELECT uuid, name FROM phone_brands LIMIT 4"))

"""
Output:
[
 {'uuid': 'f47ac10b-58cc-4372-a567-0e02b2c3d479', 'name': 'Nokia'},
 {'uuid': 'f7bac13b-58cc-4372-a567-0e02b2c3d479', 'name': 'Samsung'}, 
 {'uuid': 'f98ac12b-58cc-4372-a567-0e02b2c3d479', 'name': 'Apple'},
 {'uuid': 'f47ac10b-58cc-4972-a567-0e02b2c3d479', 'name': 'Sony'}
]
"""

# Generate CSV data
print(faker.csv("SELECT id, color FROM colors LIMIT 2"))

"""
Output:
id,color
1,red
2,blue
"""

More Advanced Example with Data Models

You can define custom data models to control the structure of your fake data.

from llmfsd import Faker, DataModel

# Define data models

model = DataModel("dogs", 
    {"id": "Number in range(5,20)", "name": None, "breed": "Breed of the dog"}
)

# Initialize Faker with data models
faker = Faker(model_id="mistral:mistral-large-latest", data_models=[model])

# Generate JSON data for a specific model
print(faker.json("SELECT * FROM dogs LIMIT 3"))

"""
Output:
[
  {
    "id": 7,
    "name": "Buddy",
    "breed": "Labrador"
  },
  {
    "id": 12,
    "name": "Charlie",
    "breed": "Golden Retriever"
  },
  {
    "id": 15,
    "name": "Max",
    "breed": "German Shepherd"
  }
]
"""

AI Providers

To initialize with different providers, set the model_id parameter during Faker initialization using aisuite format.

Examples

faker1 = Faker(model_id="groq:llama-3.2-3b-preview")

faker2 = Faker(model_id="openai:gpt-3.5-turbo")

faker3 = Faker(model_id="huggingface:mistralai/Mistral-7B-Instruct-v0.3")

Each provider requires proper API_KEY. Use environment variables or configuration files to store your API keys securely. For example you need mistral you need MISTRAL_API_KEY

export MISTRAL_API_KEY="your-mistral-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Methods

json(query: str, output: Optional[str] = None) -> list[dict] | None

Generate fake structured data in JSON format.

  • query: The SQL query to execute.
  • output: File path to save the JSON output. If None, returns the data directly.

csv(query: str, output: Optional[str] = None) -> str | None

Generate fake structured data in CSV format.

  • query: The SQL query to execute.
  • output: File path to save the CSV output. If None, returns the data directly.

Custom Data Models

You can create custom schemas using DataModel, defining either a list of attributes or a dictionary with descriptions.

DataModel allows you to use * as a wildcard in queries or provide minimal descriptions for your attributes to the LLM.

Avoid providing unnecessary descriptions, as they can increase token consumption. It is recommended to use a list of attributes if the attributes are self-explanatory for the LLM. When using a dictionary-based schema, you can leave None for some attributes and provide descriptions only for those you wish to clarify.

Example:

from llmfsd import DataModel

Schema as a list

model1 = DataModel("cars", ["brand", "model", "year"])

Schema as a dictionary

model2 = DataModel("pets", {
    "id" : "uuid string",
    "name": None,
    "age":  None,
    "species": "Type of pet (e.g., dog, cat)"
})

Pass these models to Faker during initialization:

faker = Faker(model_id="openai:gpt-4o", data_models=[model1, model2])

Saving Output to a File

Both json and csv methods support saving results directly to a file.

Save JSON data to a file

faker.json("SELECT * FROM artists LIMIT 20", output="artists.json")

Save CSV data to a file

faker.csv("SELECT name, age FROM pets LIMIT 20", output="pets.csv")

Github

https://github.com/dinyad-prog00/llmfsd

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmfsd-0.1.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmfsd-0.1.1-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file llmfsd-0.1.1.tar.gz.

File metadata

  • Download URL: llmfsd-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.15 Darwin/24.0.0

File hashes

Hashes for llmfsd-0.1.1.tar.gz
Algorithm Hash digest
SHA256 082a0f89bc9237c036aa723a2e6b442f65525f31c2619fa78fda30ceea744dd4
MD5 20f5e18bc320c8952feb1ae7cae6c999
BLAKE2b-256 8cbcee6871f308664509cc0c89ef56c3389cf15043faed7829039584a9c1e70d

See more details on using hashes here.

File details

Details for the file llmfsd-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llmfsd-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.15 Darwin/24.0.0

File hashes

Hashes for llmfsd-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4adf1a2e814d73e3dd4e2d44845d5880616dd783a6d86cef4078023cf722b27e
MD5 db841c0dc0c8a1b588056c18516f4954
BLAKE2b-256 9f1d55d51eb53d86177bad16ddba8128c66f5b7206634191286c597c7d6f61b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page