llmfsd: LLM Fake Structured Data, faking Structured Data from any LLM
Project description
llmfsd: LLM Fake Structured Data
llmfsd is a Python package designed to generate fake structured data using any Large Language Model (LLM). With this package, you can execute SQL-like queries to simulate structured data in formats like JSON or CSV. The tool is highly customizable and supports the integration of multiple AI providers (thanks aisuite).
Features
- Generate fake structured data via SQL queries.
- Supports JSON and CSV output formats.
- Define custom data models to control schema and descriptions.
- Integrates with various AI providers (e.g., OpenAI, Mistral, Google, Anthropic).
Installation
Install llmfsd using pip:
pip install llmfsd
Install a Provider’s Package Along with aisuite
llmfsd supports all AI providers supported by aisuite. If you have not already installed the provider’s package, you can do so along with llmfsd. For example:
pip install "llmfsd[mistral]"
Alternatively, you can install the provider’s package directly with aisuite:
pip install "aisuite[mistral]"
For more details, visit the aisuite repository.
Usage
Basic Example
Here’s a simple example to get started:
from llmfsd import Faker
# Initialize Faker with your LLM model_id (aisuite id format)
faker = Faker(model_id="mistral:mistral-large-latest")
# Generate JSON data
print(faker.json("SELECT uuid, name FROM phone_brands LIMIT 4"))
"""
Output:
[
{'uuid': 'f47ac10b-58cc-4372-a567-0e02b2c3d479', 'name': 'Nokia'},
{'uuid': 'f7bac13b-58cc-4372-a567-0e02b2c3d479', 'name': 'Samsung'},
{'uuid': 'f98ac12b-58cc-4372-a567-0e02b2c3d479', 'name': 'Apple'},
{'uuid': 'f47ac10b-58cc-4972-a567-0e02b2c3d479', 'name': 'Sony'}
]
"""
# Generate CSV data
print(faker.csv("SELECT id, color FROM colors LIMIT 2"))
"""
Output:
id,color
1,red
2,blue
"""
More Advanced Example with Data Models
You can define custom data models to control the structure of your fake data.
from llmfsd import Faker, DataModel
# Define data models
model = DataModel("dogs",
{"id": "Number in range(5,20)", "name": None, "breed": "Breed of the dog"}
)
# Initialize Faker with data models
faker = Faker(model_id="mistral:mistral-large-latest", data_models=[model])
# Generate JSON data for a specific model
print(faker.json("SELECT * FROM dogs LIMIT 3"))
"""
Output:
[
{
"id": 7,
"name": "Buddy",
"breed": "Labrador"
},
{
"id": 12,
"name": "Charlie",
"breed": "Golden Retriever"
},
{
"id": 15,
"name": "Max",
"breed": "German Shepherd"
}
]
"""
AI Providers
To initialize with different providers, set the model_id parameter during Faker initialization using aisuite format.
Examples
faker1 = Faker(model_id="groq:llama-3.2-3b-preview")
faker2 = Faker(model_id="openai:gpt-3.5-turbo")
faker3 = Faker(model_id="huggingface:mistralai/Mistral-7B-Instruct-v0.3")
Each provider requires proper API_KEY. Use environment variables or configuration files to store your API keys securely. For example you need mistral you need MISTRAL_API_KEY
export MISTRAL_API_KEY="your-mistral-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
Methods
json(query: str, output: Optional[str] = None) -> list[dict] | None
Generate fake structured data in JSON format.
- query: The SQL query to execute.
- output: File path to save the JSON output. If None, returns the data directly.
csv(query: str, output: Optional[str] = None) -> str | None
Generate fake structured data in CSV format.
- query: The SQL query to execute.
- output: File path to save the CSV output. If None, returns the data directly.
Custom Data Models
You can create custom schemas using DataModel, defining either a list of attributes or a dictionary with descriptions.
DataModel allows you to use * as a wildcard in queries or provide minimal descriptions for your attributes to the LLM.
Avoid providing unnecessary descriptions, as they can increase token consumption. It is recommended to use a list of attributes if the attributes are self-explanatory for the LLM. When using a dictionary-based schema, you can leave None for some attributes and provide descriptions only for those you wish to clarify.
Example:
from llmfsd import DataModel
Schema as a list
model1 = DataModel("cars", ["brand", "model", "year"])
Schema as a dictionary
model2 = DataModel("pets", {
"id" : "uuid string",
"name": None,
"age": None,
"species": "Type of pet (e.g., dog, cat)"
})
Pass these models to Faker during initialization:
faker = Faker(model_id="openai:gpt-4o", data_models=[model1, model2])
Saving Output to a File
Both json and csv methods support saving results directly to a file.
Save JSON data to a file
faker.json("SELECT * FROM artists LIMIT 20", output="artists.json")
Save CSV data to a file
faker.csv("SELECT name, age FROM pets LIMIT 20", output="pets.csv")
Github
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmfsd-0.1.1.tar.gz.
File metadata
- Download URL: llmfsd-0.1.1.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.10.15 Darwin/24.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
082a0f89bc9237c036aa723a2e6b442f65525f31c2619fa78fda30ceea744dd4
|
|
| MD5 |
20f5e18bc320c8952feb1ae7cae6c999
|
|
| BLAKE2b-256 |
8cbcee6871f308664509cc0c89ef56c3389cf15043faed7829039584a9c1e70d
|
File details
Details for the file llmfsd-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llmfsd-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.10.15 Darwin/24.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4adf1a2e814d73e3dd4e2d44845d5880616dd783a6d86cef4078023cf722b27e
|
|
| MD5 |
db841c0dc0c8a1b588056c18516f4954
|
|
| BLAKE2b-256 |
9f1d55d51eb53d86177bad16ddba8128c66f5b7206634191286c597c7d6f61b8
|