A Python library for synthetic data generation
Project description
Synthetic Data Generation Service -- WORK IN PROGRESS
A Python-based open-source service for generating synthetic data while preserving data utility.
Features
Core Features
- Synthetic Data Generation:
- Statistical data generation
- Pattern-based generation
- Data distribution preservation
- Synthetic data from various sources
Optional Features
-
REST API Service:
- Generate synthetic data via API
- Support for CSV file uploads
- Support for unstructured files (images, PDFs, documents)
- JSON and CSV output formats
-
Project Management:
- Create and manage projects
- Unique project names
- Project-based transaction tracking
- Project descriptions
-
Database Integration:
- PostgreSQL backend
- Transaction logging
- Audit trail for all operations
Core Module Usage
Installation
Install the package using pip:
pip install syda
Basic Usage
Synthetic Data Generation
Basic Usage
You can generate synthetic data and write it directly to a CSV file using the output_path argument:
from syda.structured import SyntheticDataGenerator
generator = SyntheticDataGenerator()
schema = {
'patient_id': 'number',
'diagnosis_code': 'icd10_code',
'email': 'email',
'visit_date': 'date',
'notes': 'text'
}
prompt = "Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes."
output_path = 'synthetic_output.csv'
generated_file = generator.generate_data(
schema=schema,
prompt=prompt,
sample_size=15,
output_path=output_path
)
print(f"Synthetic data written to: {generated_file}")
SQLAlchemy Model Integration with Referential Integrity
Alternatively, you can use SQLAlchemy model classes directly as schema input, including maintaining referential integrity between related models:
from sqlalchemy import Column, Integer, String, ForeignKey, Float, create_engine
from sqlalchemy.orm import declarative_base, relationship
import random
import pandas as pd
from syda.structured import SyntheticDataGenerator
Base = declarative_base()
# Define related models with foreign key relationships
class Department(Base):
__tablename__ = 'departments'
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False)
location = Column(String)
budget = Column(Float)
# One-to-many: one department has many employees
employees = relationship("Employee", back_populates="department")
def __repr__(self):
return f"<Department(id={self.id}, name='{self.name}', location='{self.location}')>"
class Employee(Base):
__tablename__ = 'employees'
id = Column(Integer, primary_key=True)
first_name = Column(String, nullable=False)
last_name = Column(String, nullable=False)
email = Column(String, nullable=False)
department_id = Column(Integer, ForeignKey('departments.id'))
role = Column(String)
salary = Column(Float)
# Many-to-one: many employees belong to one department
department = relationship("Department", back_populates="employees")
def __repr__(self):
return f"<Employee(id={self.id}, name='{self.first_name} {self.last_name}', role='{self.role}')>"
# Step 1: Generate departments first
generator = SyntheticDataGenerator()
departments_df = generator.generate_data(
schema=Department,
prompt="""
Generate realistic department data for a technology company.
Departments should have names like Engineering, Marketing, Sales, HR, etc.
Locations should be major cities around the world.
Budget should be a realistic amount for each department, in USD.
""",
sample_size=5,
output_path='departments.csv'
)
# Step 2: Create a custom foreign key generator for employees that references valid departments
departments_df = pd.read_csv('departments.csv')
# Register a custom generator for foreign key columns
def department_id_fk_generator(row, col_name):
# Sample from the existing department IDs
return random.choice(departments_df['id'].tolist())
# Register the custom foreign key generator
generator.register_generator('foreign_key', department_id_fk_generator)
# Step 3: Generate employee data with valid department_id references
employees_df = generator.generate_data(
schema=Employee,
prompt="""
Generate realistic employee data for a technology company.
Employees should have common first and last names.
Emails should follow the pattern firstname.lastname@company.com.
Roles should include software engineers, product managers, designers, etc.
Salaries should be realistic amounts in USD.
""",
sample_size=20,
output_path='employees.csv'
)
# Verify referential integrity
valid_dept_ids = set(departments_df['id'].tolist())
employee_dept_ids = set(employees_df['department_id'].tolist())
print("Verifying referential integrity...")
if employee_dept_ids.issubset(valid_dept_ids):
print("✅ All employee department_id values reference valid departments")
else:
invalid_ids = employee_dept_ids - valid_dept_ids
print(f"❌ Found {len(invalid_ids)} invalid department_id references: {invalid_ids}")
Key features:
- Foreign keys are automatically detected and assigned the type
'foreign_key' - When handling related models, generate parent records first (departments)
- Register a custom generator for foreign keys that samples from existing valid IDs
- This approach maintains referential integrity across your generated data
- Works with all SQLAlchemy column types and relationships
Output Options
- If
output_pathis provided (must end with.csvor.json), the file will be written and the method returns the file path. - If not, a pandas DataFrame is returned.
See examples/ directory for complete examples, including:
- Basic schema-based data generation
- SQLAlchemy model integration
- Foreign key relationship handling
Optional REST API Service
Prerequisites
- Python 3.8+
- PostgreSQL database
- pip (Python package manager)
Installation
-
Clone the repository:
git clone <repository-url> cd syda-service
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
Database Setup
-
Create a PostgreSQL database for the application.
-
Configure the database connection by creating a
.envfile in the root directory with the following variables:DATABASE_URL=postgresql://username:password@localhost:5432/your_database_name -
Run Alembic migrations to set up the database tables:
# Navigate to the service directory cd service # Create initial migration (only needed once) alembic revision --autogenerate -m "Initial migration" # Apply migrations alembic upgrade head
Running the API Service
-
Start the FastAPI server:
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
-
The API will be available at
http://localhost:8000 -
Access the interactive API documentation at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
- Swagger UI:
API Endpoints
Project Management
POST /projects
Create a new project Example request:
{
"name": "my_project",
"description": "My synthetic data project"
}
GET /projects
Get all projects
GET /projects/{project_name}/transactions
Get transactions for a specific project
Data Operations (Project-based)
Example request:
{
"project_name": "my_project",
"data": {
"email": ["test@example.com"],
"phone": ["123-456-7890"]
}
}
POST /generate
Generate synthetic data for a project
POST /generate/test-data
Generate test data for a project
POST /upload/generate
Upload and generate synthetic data for a project
POST /upload/unstructured
Process unstructured file (image, PDF, Word, Excel, text)
POST /upload/unstructured/generate
Generate synthetic data from unstructured file (Excel only)
Contributing
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for the full license text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syda-0.0.1b0.tar.gz.
File metadata
- Download URL: syda-0.0.1b0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0ed5b0e941358e005e38214b573ad25e5e0ae9d6ff042e9a42da63e66b68f37
|
|
| MD5 |
a94bd7c3006683a4c2bd0a832cf6c8e9
|
|
| BLAKE2b-256 |
d77b14b40839bc79f196a3feb5633954951834a7527155bb82bc313a9e122334
|
File details
Details for the file syda-0.0.1b0-py3-none-any.whl.
File metadata
- Download URL: syda-0.0.1b0-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0208247ae35870f90df82cb1665b6761afdfe1d802dada6c7af86c5c670f159
|
|
| MD5 |
6e92ea0a6a3d0d92849eb9958f01c136
|
|
| BLAKE2b-256 |
fb44b1244e37983daab0ca0f1603636cbfe037ae8c20710c0bd90954353e4ca2
|