Skip to main content

A Python library for synthetic data generation

Project description

Synthetic Data Generation Service -- WORK IN PROGRESS

A Python-based open-source service for generating synthetic data while preserving data utility.

Features

Core Features

  • Synthetic Data Generation:
    • Statistical data generation
    • Pattern-based generation
    • Data distribution preservation
    • Synthetic data from various sources

Optional Features

  • REST API Service:

    • Generate synthetic data via API
    • Support for CSV file uploads
    • Support for unstructured files (images, PDFs, documents)
    • JSON and CSV output formats
  • Project Management:

    • Create and manage projects
    • Unique project names
    • Project-based transaction tracking
    • Project descriptions
  • Database Integration:

    • PostgreSQL backend
    • Transaction logging
    • Audit trail for all operations

Core Module Usage

Installation

Install the package using pip:

pip install syda

Basic Usage

Synthetic Data Generation

Basic Usage

You can generate synthetic data and write it directly to a CSV file using the output_path argument:

from syda.structured import SyntheticDataGenerator

generator = SyntheticDataGenerator()
schema = {
    'patient_id': 'number',
    'diagnosis_code': 'icd10_code',
    'email': 'email',
    'visit_date': 'date',
    'notes': 'text'
}
prompt = "Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes."

output_path = 'synthetic_output.csv'
generated_file = generator.generate_data(
    schema=schema,
    prompt=prompt,
    sample_size=15,
    output_path=output_path
)
print(f"Synthetic data written to: {generated_file}")
SQLAlchemy Model Integration with Referential Integrity

Alternatively, you can use SQLAlchemy model classes directly as schema input, including maintaining referential integrity between related models:

from sqlalchemy import Column, Integer, String, ForeignKey, Float, create_engine
from sqlalchemy.orm import declarative_base, relationship
import random
import pandas as pd
from syda.structured import SyntheticDataGenerator

Base = declarative_base()

# Define related models with foreign key relationships
class Department(Base):
    __tablename__ = 'departments'
    
    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)
    location = Column(String)
    budget = Column(Float)
    
    # One-to-many: one department has many employees
    employees = relationship("Employee", back_populates="department")
    
    def __repr__(self):
        return f"<Department(id={self.id}, name='{self.name}', location='{self.location}')>"


class Employee(Base):
    __tablename__ = 'employees'
    
    id = Column(Integer, primary_key=True)
    first_name = Column(String, nullable=False)
    last_name = Column(String, nullable=False)
    email = Column(String, nullable=False)
    department_id = Column(Integer, ForeignKey('departments.id'))
    role = Column(String)
    salary = Column(Float)
    
    # Many-to-one: many employees belong to one department
    department = relationship("Department", back_populates="employees")
    
    def __repr__(self):
        return f"<Employee(id={self.id}, name='{self.first_name} {self.last_name}', role='{self.role}')>"

# Step 1: Generate departments first
generator = SyntheticDataGenerator()
    
departments_df = generator.generate_data(
    schema=Department,
    prompt="""
    Generate realistic department data for a technology company.
    Departments should have names like Engineering, Marketing, Sales, HR, etc.
    Locations should be major cities around the world.
    Budget should be a realistic amount for each department, in USD.
    """,
    sample_size=5,
    output_path='departments.csv'
)

# Step 2: Create a custom foreign key generator for employees that references valid departments
departments_df = pd.read_csv('departments.csv')

# Register a custom generator for foreign key columns
def department_id_fk_generator(row, col_name):
    # Sample from the existing department IDs
    return random.choice(departments_df['id'].tolist())

# Register the custom foreign key generator
generator.register_generator('foreign_key', department_id_fk_generator)

# Step 3: Generate employee data with valid department_id references
employees_df = generator.generate_data(
    schema=Employee,
    prompt="""
    Generate realistic employee data for a technology company.
    Employees should have common first and last names.
    Emails should follow the pattern firstname.lastname@company.com.
    Roles should include software engineers, product managers, designers, etc.
    Salaries should be realistic amounts in USD.
    """,
    sample_size=20,
    output_path='employees.csv'
)

# Verify referential integrity
valid_dept_ids = set(departments_df['id'].tolist())
employee_dept_ids = set(employees_df['department_id'].tolist())

print("Verifying referential integrity...")
if employee_dept_ids.issubset(valid_dept_ids):
    print("✅ All employee department_id values reference valid departments")
else:
    invalid_ids = employee_dept_ids - valid_dept_ids
    print(f"❌ Found {len(invalid_ids)} invalid department_id references: {invalid_ids}")

Key features:

  • Foreign keys are automatically detected and assigned the type 'foreign_key'
  • When handling related models, generate parent records first (departments)
  • Register a custom generator for foreign keys that samples from existing valid IDs
  • This approach maintains referential integrity across your generated data
  • Works with all SQLAlchemy column types and relationships
Output Options
  • If output_path is provided (must end with .csv or .json), the file will be written and the method returns the file path.
  • If not, a pandas DataFrame is returned.

See examples/ directory for complete examples, including:

  • Basic schema-based data generation
  • SQLAlchemy model integration
  • Foreign key relationship handling

Optional REST API Service

Prerequisites

  • Python 3.8+
  • PostgreSQL database
  • pip (Python package manager)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd syda-service
    
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
    
  3. Install dependencies:

    pip install -r requirements.txt
    

Database Setup

  1. Create a PostgreSQL database for the application.

  2. Configure the database connection by creating a .env file in the root directory with the following variables:

    DATABASE_URL=postgresql://username:password@localhost:5432/your_database_name
    
  3. Run Alembic migrations to set up the database tables:

    # Navigate to the service directory
    cd service
    
    # Create initial migration (only needed once)
    alembic revision --autogenerate -m "Initial migration"
    
    # Apply migrations
    alembic upgrade head
    

Running the API Service

  1. Start the FastAPI server:

    uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
    
  2. The API will be available at http://localhost:8000

  3. Access the interactive API documentation at:

    • Swagger UI: http://localhost:8000/docs
    • ReDoc: http://localhost:8000/redoc

API Endpoints

Project Management

POST /projects

Create a new project Example request:

{
    "name": "my_project",
    "description": "My synthetic data project"
}
GET /projects

Get all projects

GET /projects/{project_name}/transactions

Get transactions for a specific project

Data Operations (Project-based)

Example request:

{
    "project_name": "my_project",
    "data": {
        "email": ["test@example.com"],
        "phone": ["123-456-7890"]
    }
}
POST /generate

Generate synthetic data for a project

POST /generate/test-data

Generate test data for a project

POST /upload/generate

Upload and generate synthetic data for a project

POST /upload/unstructured

Process unstructured file (image, PDF, Word, Excel, text)

POST /upload/unstructured/generate

Generate synthetic data from unstructured file (Excel only)

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for the full license text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syda-0.0.1b0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syda-0.0.1b0-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file syda-0.0.1b0.tar.gz.

File metadata

  • Download URL: syda-0.0.1b0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for syda-0.0.1b0.tar.gz
Algorithm Hash digest
SHA256 b0ed5b0e941358e005e38214b573ad25e5e0ae9d6ff042e9a42da63e66b68f37
MD5 a94bd7c3006683a4c2bd0a832cf6c8e9
BLAKE2b-256 d77b14b40839bc79f196a3feb5633954951834a7527155bb82bc313a9e122334

See more details on using hashes here.

File details

Details for the file syda-0.0.1b0-py3-none-any.whl.

File metadata

  • Download URL: syda-0.0.1b0-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for syda-0.0.1b0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0208247ae35870f90df82cb1665b6761afdfe1d802dada6c7af86c5c670f159
MD5 6e92ea0a6a3d0d92849eb9958f01c136
BLAKE2b-256 fb44b1244e37983daab0ca0f1603636cbfe037ae8c20710c0bd90954353e4ca2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page