Skip to main content

A Python package to generate fake tabular data. Get data in pandas dataframe or export to Parquet, DeltaLake, Csv, Json, Excel or Sql

Project description

Table Faker

screenshoot tablefaker is a versatile Python package that enables effortless generation of realistic yet synthetic table data for various applications. Whether you need test data for software development, this tool simplifies the process with an intuitive schema definition in YAML format.

Key Features

  • Schema Definition: Define your table schema using a simple YAML file, specifying column names, data types, fake data generation logic, and relationships.
  • Faker & Randomization: Utilize the Faker library and random data generation to create authentic-looking synthetic data.
  • Multiple Table Support: Create multiple tables with different schemas and data generation logic in a single YAML file. Define relationships between tables for foreign keys and primary keys.
  • Multiple Output Formats:
    • Pandas DataFrame
    • SQL insert script
    • CSV
    • Parquet
    • JSON
    • Excel
    • Delta Lake

Installation

pip install tablefaker

Example Yaml File

version: 1
config:
  locale: en_US
  python_import:
    - dateutil
tables:
  - table_name: person
    row_count: 10
    start_row_id: 101                               # you can set row_id starting point
    export_file_count: 3                           # you can set export file count (dominant to export_file_row_count)
    columns:
      - column_name: id
        data: row_id                                # row_id is a built-in function
        is_primary_key: true                        # define primary key to use as a foreign key
      - column_name: first_name
        data: fake.first_name()                     # faker function
        type: string
      - column_name: last_name
        data: fake.last_name()
        type: string
      - column_name: full_name
        data: first_name + " " + last_name           # use a column to generate a new column
        is_primary_key: true
      - column_name: age
        data: fake.random_int(18, 90)
        type: int32
      - column_name: street_address
        data: fake.street_address()
      - column_name: city
        data: fake.city()
      - column_name: state_abbr
        data: fake.state_abbr()
      - column_name: postcode
        data: fake.postcode()
      - column_name: gender
        data: random.choice(["male", "female"])     # random.choice is a built-in function
        null_percentage: 0.5                        # null_percentage is a built-in function
      - column_name: left_handed
        data: fake.pybool()
      - column_name: today
        data: datetime.today().strftime('%Y-%m-%d') # datetime package is available by default
      - column_name: easter_date
        data: dateutil.easter.easter(2025).strftime('%Y-%m-%d') # python package you need to import in python_import
      - column_name: discount_eligibility           # custom python function
      data: |
        if age < 25 or age > 60:
          return True
        else:
          return False
  - table_name: employee
    row_count: 10
    export_file_row_count: 60                      # you can set export file row count
    columns:
      - column_name: id
        data: row_id
      - column_name: person_id
        data: foreign_key("person", "id")          # get primary key from another table
      - column_name: full_name
        data: foreign_key("person", "full_name")
      - column_name: hire_date
        data: fake.date_between()
        type: string
      - column_name: title
        data: random.choice(["engineer", "senior engineer", "principal engineer", "director", "senior director", "manager", "vice president", "president"])
      - column_name: salary
        data: None #NULL
        type: float
      - column_name: height
        data: r"170 cm" #string
      - column_name: weight
        data: 150 #number
      - column_name: school
        data: fake.school_name() # custom provider
      - column_name: level
        data: get_level() # custom function

full yml example

Data Generation

You can define your dummy data generation logic in a Python function. The Faker, random and datetime packages are pre-imported and ready to use.

  • Use the Faker package for realistic data, e.g., fake.first_name() or fake.random_int(1, 10).
  • Use the random package for basic randomness, e.g., random.choice(["male", "female"]).
  • Use the datetime package for current date and time, e.g., datetime.today().strftime('%Y-%m-%d').
  • You can use a column to generate a new column, e.g., first_name + " " + last_name.
  • Use is_primary_key to define a primary key, e.g., is_primary_key: true.
  • Use foreign_key to get a primary key from another table, e.g., foreign_key("person", "id"). If you use multiple foreign key functions, you will get the primary key values from the same row.

You can write your logic in a single line or multiple lines, depending on your preference. A built-in function, row_id, provides a unique integer for each row. You can specify row_id starting point using the start_row_id keyword.

In addition, you have control over how your data is exported:

  • export_file_count: This keyword lets you specify the total number of output files to generate. It's especially useful when you need to split a large dataset into multiple, more manageable files.
  • export_file_row_count: Use this keyword to set the maximum number of rows that each exported file should contain. This ensures that each file remains within a desired size limit and is easier to handle.

Columns will automatically have the best-fitting data type. However, if you'd like to specify a data type, use the type keyword. You can assign data types using NumPy dtypes, Pandas Extension Dtypes, or Python native types.

Here are some examples:

fake.first_name()
fake.random_int(1, 10)
random.choice(["male", "female"])
datetime.today()
911 # number
r"170 cm" # string

Example Code

import tablefaker

# exports to current folder in csv format
tablefaker.to_csv("test_table.yaml")

# exports to sql insert into scripts to insert to your database
tablefaker.to_sql("test_table.yaml")

# exports all tables in json format
tablefaker.to_json("test_table.yaml", "./target_folder")

# exports all tables in parquet format
tablefaker.to_parquet("test_table.yaml", "./target_folder")

# exports all tables in deltalake format
tablefaker.to_deltalake("test_table.yaml", "./target_folder")

# export single table to the provided folder
tablefaker.to_deltalake("test_table.yaml", "./target_folder/person/", table_name="person")

# exports only the first table in excel format
tablefaker.to_excel("test_table.yaml", "./target_folder/target_file.xlsx")

# get as pandas dataframes
df_dict = tablefaker.to_pandas("test_table.yaml")
person_df = df_dict["person"]
print(person_df.head(5))

Sample CLI Command

You can use tablefaker in your terminal for adhoc needs or shell script to automate fake data generation.
Faker custom providers and custom functions are not supported in CLI.

# exports to current folder in csv format
tablefaker --config test_table.yaml

# exports as sql insert into script files
tablefaker --config test_table.yaml --file_type sql

# exports to current folder in excel format
tablefaker --config test_table.yaml --file_type excel

# exports all tables in json format
tablefaker --config test_table.yaml --file_type json --target ./target_folder 

# exports only the first table
tablefaker --config test_table.yaml --file_type parquet --target ./target_folder/target_file.parquet

# exports to current folder in deltalake format
tablefaker --config test_table.yaml --file_type deltalake

Sample CSV Output

id,first_name,last_name,age,dob,salary,height,weight
1,John,Smith,35,1992-01-11,,170 cm,150
2,Charles,Shepherd,27,1987-01-02,,170 cm,150
3,Troy,Johnson,42,,170 cm,150
4,Joshua,Hill,86,1985-07-11,,170 cm,150
5,Matthew,Johnson,31,1940-03-31,,170 cm,150

Sample Sql Output

INSERT INTO employee
(id,person_id,hire_date,title,salary,height,weight,school,level)
VALUES
(1, 4, '2020-10-09', 'principal engineer', NULL, '170 cm', 150, 'ISLIP HIGH SCHOOL', 'level 2'),
(2, 9, '2002-12-20', 'principal engineer', NULL, '170 cm', 150, 'GUY-PERKINS HIGH SCHOOL', 'level 1'),
(3, 2, '1996-01-06', 'principal engineer', NULL, '170 cm', 150, 'SPRINGLAKE-EARTH ELEM/MIDDLE SCHOOL', 'level 3');

Custom Faker Providers

You can add and use custom / community faker providers with table faker.
Here is a list of these community providers.
https://faker.readthedocs.io/en/master/communityproviders.html#

version: 1
config:
  locale: en_US
tables:
  - table_name: employee
    row_count: 5
    columns:
      - column_name: id
        data: row_id
      - column_name: person_id
        data: fake.random_int(1, 10)
      - column_name: hire_date
        data: fake.date_between()
      - column_name: school
        data: fake.school_name()  # custom provider
import tablefaker

# import the custom faker provider
from faker_education import SchoolProvider

# provide the faker provider class to the tablefaker using fake_provider
# you can add a single provider or a list of providers
tablefaker.to_csv("test_table.yaml", "./target_folder", fake_provider=SchoolProvider)
# this works with all other to_ methods as well.

Custom Functions

With Table Faker, you have the flexibility to provide your own custom functions to generate column data. This advanced feature empowers developers to create custom fake data generation logic that can pull data from a database, API, file, or any other source as needed.
You can also supply multiple functions in a list, allowing for even more versatility.
The custom function you provide should return a single value, giving you full control over your synthetic data generation.

from tablefaker import tablefaker
from faker import Faker

fake = Faker()
def get_level():
    return f"level {fake.random_int(1, 5)}"

tablefaker.to_csv("test_table.yaml", "./target_folder", custom_function=get_level)

Add get_level function to your yaml file

version: 1
config:
  locale: en_US
tables:
  - table_name: employee
    row_count: 5
    columns:
      - column_name: id
        data: row_id
      - column_name: person_id
        data: fake.random_int(1, 10)
      - column_name: hire_date
        data: fake.date_between()
      - column_name: level
        data: get_level() # custom function

Support & Donation

If you find Table Faker useful and would like to support its development, consider making a donation.

Additional Resources

Roadmap

TODO

  • Variables
  • Generate template yaml file from sample data
  • use an ai service to generate data generation logic

Future Enhancements

  • PyArrow table support
  • Avro file support
  • Add target file name to YAML

Follow for Updates: LinkedIn
Author: Necati Arslan | Email

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tablefaker-1.6.0.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

tablefaker-1.6.0-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file tablefaker-1.6.0.tar.gz.

File metadata

  • Download URL: tablefaker-1.6.0.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for tablefaker-1.6.0.tar.gz
Algorithm Hash digest
SHA256 f56bcda30b040b81047b2e58b26d4e5b95055e148f06f121a59d7ec9f22aa994
MD5 a4238109f11b7c435ab75360b53a13c5
BLAKE2b-256 e00ec316afddfcbdf1f1286b9f631dcdbc06b8057275701c0be877bf9e36bf5e

See more details on using hashes here.

File details

Details for the file tablefaker-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: tablefaker-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for tablefaker-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad688cf0fa1d179f1ca3d4f8b69e6bccc098058d47bc65df45ed7a7666374c55
MD5 a20e6ee2e56dc4e94fb9a559d3b8f1db
BLAKE2b-256 61ac1defc5b4e2660aae4a49649c510e453cbc6f2b5d68230efdcb39383d451e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page