Metrics for Synthetic Data Generation Projects
The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example quality and privacy. It also includes reports that you can run to generate insights, visualize data and share with your team.
The SDMetrics library is model-agnostic, meaning you can use any synthetic data. The library does not need to know how you created the data.
Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
pip install sdmetrics
conda install -c conda-forge sdmetrics
For more information about using SDMetrics, visit the SDMetrics Documentation.
Get started with SDMetrics Reports using some demo data,
from sdmetrics import load_demo from sdmetrics.reports.single_table import QualityReport real_data, synthetic_data, metadata = load_demo(modality='single_table') my_report = QualityReport() my_report.generate(real_data, synthetic_data, metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00, 5.22it/s] Overall Quality Score: 82.84% Properties: Column Shapes: 82.78% Column Pair Trends: 82.9%
Once you generate the report, you can drill down on the details and visualize the results.
my_report.get_visualization(property_name='Column Pair Trends')
Save the report and share it with your team.
my_report.save(filepath='demo_data_quality_report.pkl') # load it at any point in the future my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')
Want more metrics? You can also manually apply any of the metrics in this library to your data.
# calculate whether the synthetic data respects the min/max bounds # set by the real data from sdmetrics.single_column import BoundaryAdherence BoundaryAdherence.compute( real_data['start_date'], synthetic_data['start_date'] )
# calculate whether the synthetic data is new or whether it's an exact copy of the real data from sdmetrics.single_table import NewRowSynthesis NewRowSynthesis.compute( real_data, synthetic_data, metadata )
To learn more about the reports and metrics, visit the SDMetrics Documentation.
The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
- 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
- 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
- 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.
Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.
v0.13.0 - 2023-12-04
This release makes significant improvements to the Diagnostic Reports! The report now runs a diagnostic to calculate scores for three basic but important properties of your data: data validity, data structure and in the multi table case, relationship validity. Data validity checks that the columns of your data are valid (eg. correct range or values). Data structure makes sure the synthetic data has the correct columns. Relationship validity checks to make sure key references are correct and the cardinality is within ranges seen in the real data. These changes are meant to make the
DiagnosticReport a quick way for you to see if there are any major problems with your synthetic data.
Additionally, some general improvements were made and bugs were resolved. The
SVCDetection metrics were fixed to only use boolean, categorical, datetime and numeric columns in their calculations. A bug that prevented visualizations from displaying on Jupyter notebooks was patched. The cardinality property in the multi table
QualityReport can now handle multiple foreign keys to the same parent. Finally, a new visualization was added for sequential/timeseries data called
- Detection metrics should only use statistically modeled columns (filter out the rest) - Issue #286 by @lajohn4747
- Add visualization for timeseries / sequential data - Issue #376 by @lajohn4747
- Multi table quality report should handle multi-foreign keys (to same parent) - Issue #406 by @R-Palazzo
KeyUniquenessmetric - Issue #460 by @R-Palazzo
ReferentialIntegritymetric - Issue #461 by @R-Palazzo
CategoryAdherencemetric - Issue #462 by @R-Palazzo
TableFormatmetric - Issue #463 by @R-Palazzo
CardinalityBoundaryAdherencemetric - Issue #464 by @frances-h
DataValidityproperty - Issue #467 by @R-Palazzo
Structureproperty - Issue #468 by @R-Palazzo
Relationship Validityproperty - Issue #469 by @R-Palazzo
DiagnosticReportto calculate base correctness of synthetic data - Issue #471 by @R-Palazzo
- Update the synthetic data that's available for the multi-table demo - Issue #501 by @R-Palazzo
- Update the synthetic data that's available for the single-table demo - Issue #502 by @R-Palazzo
TableStructure+ fix its computation - Issue #518 by @R-Palazzo
- Sometimes graphs don't show when using Jupyter notebook - Issue #322 by @pvk-developer
- Fix ReferentialIntegrity NaN handling - Issue #494 by @R-Palazzo
- KeyUniqueness metric should only be applied to primary and alternate keys - Issue #503 by @R-Palazzo
- Single table Structure property should not have visualization - Issue #504 by @R-Palazzo
- Multi table Structure property visualization has incorrect styling - Issue #505 by @R-Palazzo
UserWarning: KeyError: 'relationships'in DiagnosticReport if metadata missing relationships - Issue #506 by @R-Palazzo
validatemethod should be private - Issue #507 by @R-Palazzo
ValueErrorin DiagnosticReport if synthetic data does not match metadata - Issue #508 by @R-Palazzo
- Check if QualityReport needs the synthetic data to match the metadata - Issue #509 by @R-Palazzo
- Running single table report on multi table data (or vice versa) results in confusing error - Issue #510 by @R-Palazzo
- Add metadata validation - Issue #526 by @R-Palazzo
v0.12.1 - 2023-11-01
This release fixes a bug with the new Intertable Trends property and older pandas versions and a bug with how the ML Efficacy metric handled train and test data. Reports handle missing relationships more gracefully.
- Multiple FutureWarning lines printed out when running the Quality Report (Intertable Trends property) - Issue #490 by @frances-h
- Transformer should not be fit on test data - Issue #291 by @fealho
- Reports should not crash if there are no relationships - Issue #481 by @lajohn4747
v0.12.0 - 2023-10-31
This release adds a new property, InterTable Trends. Several plots were moved from the reports module into the new visualizations module. The
metadata parameter was removed for these plots, and the
plot_types parameter was added.
plot_types lets the user control which plot type is used. Several crashes have been resolved.
- Provide meta information about the reports - Pull #472 by @frances-h
- Validate that the metadata is always a dict - Issue #428 by @R-Palazzo
- Expose reports module in top-level init - Pull #459 by @frances-h
- Add new get_column_pair_plot - Issue #444 by @pvk-developer
- Add InterTable Trends property - Issue #451 by @frances-h
- Add new get_column_plot - Issue #443 by @pvk-developer
- Add new get_cardinality_plot - Issue #445 by @frances-h
- Create visualizations module - Issue #442 by @frances-h, @pvk-developer
NewRowSynthesison datetime columns without formats - Issue #473 by @fealho
- Intertable trends property crashes if a table has no statistical columns - Issue #476 by @lajohn4747
- Fix BoundaryAdherence NaN handling - Issue #470 by @frances-h
- The Intertable Trends visualization is mislabeled as 'Column Shapes' - Issue #477 by @lajohn4747
- ValueError when using get_cardinality_plot on some schemas - Issue #447 by @frances-h
- Switch default branch from master to main - Issue #420 by @amontanez24
v0.11.1 - 2023-09-14
This release makes multiple changes to better handle errors that get raised from the
DiagnosticReport. The report should be able to run to completion now and have any errors that it encounters reported in a column on the details that can be observed from running
get_details. It also resolves many warnings that were interrupting the printing of the report's results and progress.
- Create single table coverage property - Issue #389 by @R-Palazzo
- Create single table synthesis property - Issue #390 by @R-Palazzo
- Create single table Boundaries property - Issue #391 by @R-Palazzo
- Add multi table Coverage, Synthesis and Boundaries property - Issue #393 by @R-Palazzo
- Ensure that the
Synthesisproperty score doesn't change - Issue #425 by @amontanez24
- The Error column contains a mix of
Nonevalues - Issue #427 by @pvk-developer
- Always show the
get_details- Issue #429 by @frances-h
- Diagnostic explanations should not repeat if I generate multiple times - Issue #430 by @amontanez24
- RangeCoverage errors on datetime columns in DiagnosticReport - Issue #431 by @frances-h
- The coverage visualization shows empty bar graph for nan values - Issue #432 by @frances-h
- Diagnostic report should skip over all NaN columns - Issue #433 by @pvk-developer
- Quality report is printing out a long warning message (hundreds of lines) - Issue #448 by @amontanez24
- Use property classes in single table DiagnosticReport - Issue #392 by @R-Palazzo
- Use property classes in multi table DiagnosticReport - Issue #394 by @R-Palazzo
v0.11.0 - 2023-08-10
This release adds a function that allows users to plot the cardinality of foreign and primary keys in synthetic data. More specifically, it graphs the frequency that each number of children per parent row occurs in the parent table.
Additionally, architectural changes are made to improve the efficiency and error handling of the
QualityReport! The progress bar is also enhanced to be more informative when the report is generating.
This release also adds support for Python 3.11 and drops support for Python 3.7.
- Visualize cardinality of foreign key columns - Issue #283 by @R-Palazzo
- Create single table BaseProperty class - Issue #354 by @amontanez24
- Create single table column shapes property - Issue #355 by @R-Palazzo
- Create single table column pair trends property - Issue #356 by @R-Palazzo
- Create multi table BaseProperty class - Issue #357 by @pvk-developer
- Create multi table column shapes and column pair trends properties - Issue #358 by @R-Palazzo
- Create Parent Child Relationships property class - Issue #359 by @pvk-developer
- In Multi Table Quality Report: Rename "Table Relationships" property to "Cardinality" - Issue #360 by @frances-h
- More accurate progress bar for single table Quality Report - Issue #361 by @R-Palazzo
- More accurate progress bar for multi table Quality Report - Issue #362 by @fealho
- Raise error in CorrelationSimilarity if either column is constant - Issue #407 by @fealho
- Issue in building the denormalized table inside the Parent-Child Detection metrics - Issue #328 by @fealho
- Don't modify the rounding in the quality report - Issue #401 by @R-Palazzo
- The Cardinality property is missing some relationships - Issue #404 by @pvk-developer
- The Cardinality property is not returning a DataFrame - Issue #405 by @fealho
- Overall property score should be the average across all breakdowns - Issue #415 by @amontanez24
- Use property classes in single table QualityReport - Issue #370 by @R-Palazzo
- Use property classes in multi table QualityReport - Issue #371 by @fealho
- Add add-on detection for premium metrics - Issue #388 by @amontanez24
- Add support for Python 3.11 - Issue #353 by @amontanez24
- Drop support for Python 3.7 - Issue #380 by @amontanez24
v0.10.1 - 2023-06-06
This release fixes a bug that was causing the
DiagnosticReport to crash on the
NewRowSynthesis metric. It also adds support for PyTorch 2.0!
- ValueError: multi-line expressions (NewRowSynthesis metric in DiagnosticReport) - Issue #327 by @R-Palazzo
- Upgrade to torch 2.0 - Issue #347 by @fealho
v0.10.0 - 2023-05-03
This release makes the
DiagnosticReport more fault tolerant by preventing it from crashing if a metric it uses fails. It also adds support for Pandas 2.0!
Additionally, support for the old
SDV metadata format (pre
SDV 1.0) has been dropped.
- Cleanup SDMetrics to only accept SDV 1.0 metadata format - Issue #331 by @amontanez24
- Make the diagnostic report more fault-tolerant - Issue #332 by @frances-h
- Remove upper bound for pandas - Issue #338 by @pvk-developer
v0.9.3 - 2023-04-12
This release improves the clarity of warning/error messages. We also add a version add-on, update the workflow to optimize the runtime and fix a bug in the
NewRowSynthesis metric when computing the
synthetic_sample_size for multi-table.
### New Features
- Add functionality to find version add-on - Issue #321 by @frances-h
- More detailed warning in QualityReport when there is a constant input - Issue #316 by @pvk-developer
- Make error more informative in QualityReport when tables cannot be merged - Issue #317 by @frances-h
- More detailed warning in QualityReport for unexpected category values - Issue #315 by @frances-h
- Multi table DiagnosticReport sets synthetic_sample_size too low for NewRowSynthesis - Issue #320 by @pvk-developer
v0.9.2 - 2023-03-08
This release fixes bugs in the
NewRowSynthesis metric when too many columns were present. It also fixes bugs around datetime columns that are formatted as strings in both
- Method get_column_pair_plot: Does not plot synthetic data if datetime column is formatted as a string - Issue [#310] (https://github.com/sdv-dev/SDMetrics/issues/310) by @frances-h
- Method get_column_plot: ValueError if a datetime column is formatted as a string - Issue #309 by @frances-h
- Fix ValueError in the NewRowSynthesis metric (also impacts DiagnosticReport) - Issue #307 by @frances-h
v0.9.1 - 2023-02-17
This release fixes bugs in the existing metrics and reports.
- Fix issue-296 for discrete and continuous columns - Issue #296 by @R-Palazzo
- Support new metadata for datetime_format - Issue #303 by @frances-h
v0.9.0 - 2023-01-18
This release supports Python 3.10 and drops support for Python 3.6. We also add a verbosity argument to report generation.
- Silent mode when creating reports. - Issue #269 by @katxiao
- Support Python versions >=3.7 and <3.11 - Issue 287 by @katxiao
v0.8.1 - 2022-12-09
This release fixes bugs in the existing metrics and reports. We also make the reports compatible with future SDV versions.
- Filter out additional sdtypes that will be available in future versions of SDV - Issue #265 by @katxiao
- NewRowSynthesis should ignore PrimaryKey column - Issue #260 by @katxiao
- Visualization crashes if there are metric errors - Issue #272 by @katxiao
- Score for TVComplement if synthetic data only has missing values - Issue #271 by @katxiao
- Fix 'timestamp' column metadata in the multi table demo - Issue #267 by @katxiao
- Fix 'duration' column in the single table demo - Issue #266 by @katxiao
- README.md example has a bug - Issue #262 by @katxiao
- Update README.md to fix a bug - Issue #263 by @katxiao
- Visualization get_column_pair_plot: update parameter name to column_names - Issue #258 by @katxiao
- "Column Shapes" and "Column Pair Trends" Calculation Inconsistency - Issue #254 by @katxiao
- Diagnostic Report missing RangeCoverage for numerical columns - Issue #255 by @katxiao
v0.8.0 - 2022-11-02
This release introduces the
DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.
- Fixes for new metadata - Issue #253 by @katxiao
- Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao
- Exclude pii columns from single table metrics - Issue #245 by @katxiao
- Accept both old and new metadata - Issue #244 by @katxiao
- Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao
- Update visualization average per table - Issue #242 by @katxiao
- Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao
- Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao
- Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao
- Create multi-table DiagnosticReport - Issue #215 by @katxiao
- Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao
- Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao
- Create single-table DiagnosticReport - Issue #209 by @katxiao
- Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao
- Add single table diagnostic report - Issue #237 by @katxiao
- Detection test test doesn't look at metadata when determining which columns to use - Issue #119 by @R-Palazzo
v0.7.0 - 2022-09-27
This release introduces the
QualityReport, which evaluates how well synthetic data captures mathematical properties from the real data. The
QualityReport incorporates the new metrics introduced in the previous release, and allows users to get detailed results, visualize the scores, and save the report for future viewing. We also add utility methods for visualizing columns and pairs of columns.
- Catch typeerror in new row synthesis query - Issue #234 by @katxiao
- Add NewRowSynthesis Metric - Issue #207 by @katxiao
- Update plot utilities API - Issue #228 by @katxiao
- Fix column pairs visualization bug - Issue #230 by @katxiao
- Save version - Issue #229 by @katxiao
- Update efficacy metrics API - Issue #227 by @katxiao
- Add RangeCoverage Metric - Issue #208 by @katxiao
- Add get_column_pairs_plot utility method - Issue #223 by @katxiao
- Parse date as datetime - Issue #222 by @katxiao
- Update error handling for reports - Issue #221 by @katxiao
- Visualization API update - Issue #220 by @katxiao
- Bug fixes for QualityReport - Issue #219 by @katxiao
- Update column pair metric calculation - Issue #214 by @katxiao
- Add get score methods for multi table QualityReport - Issue #190 by @katxiao
- Add multi table QualityReport visualization methods - Issue #192 by @katxiao
- Add plot_column visualization utility method - Issue #193 by @katxiao
- Add save and load behavior to multi table QualityReport - Issue #188 by @katxiao
- Create multi-table QualityReport - Issue #186 by @katxiao
- Add single table QualityReport visualization methods - Issue #191 by @katxiao
- Add save and load behavior to single table QualityReport - Issue #187 by @katxiao
- Add get score methods for single table Quality Report - Issue #189 by @katxiao
- Create single-table QualityReport - Issue #185 by @katxiao
- Auto apply "new" label instead of "pending review" - Issue #164 by @katxiao
- fix typo - Issue #195 by @fealho
v0.6.0 - 2022-08-12
This release removes SDMetric's dependency on the RDT library, and also introduces new quality and diagnostic metrics. Additionally, we introduce a new
compute_breakdown method that returns a breakdown of metric results.
- Handle null values correctly - Issue #194 by @katxiao
- Add wrapper classes for new single and multi table metrics - Issue #169 by @katxiao
- Add CorrelationSimilarity metric - Issue #143 by @katxiao
- Add CardinalityShapeSimilarity metric - Issue #160 by @katxiao
- Add CardinalityStatisticSimilarity metric - Issue #145 by @katxiao
- Add ContingencySimilarity Metric - Issue #159 by @katxiao
- Add TVComplement metric - Issue #142 by @katxiao
- Add MissingValueSimilarity metric - Issue #139 by @katxiao
- Add CategoryCoverage metric - Issue #140 by @katxiao
- Add compute breakdown column for single column - Issue #152 by @katxiao
- Add BoundaryAdherence metric - Issue #138 by @katxiao
- Get KSComplement Score Breakdown - Issue #130 by @katxiao
- Add StatisticSimilarity Metric - Issue #137 by @katxiao
- New features for KSTest.compute - Issue #129 by @amontanez24
- Add integration tests and fixes - Issue #183 by @katxiao
- Remove rdt hypertransformer dependency in timeseries metrics - Issue #176 by @katxiao
- Replace rdt LabelEncoder with sklearn - Issue #178 by @katxiao
- Remove rdt as a dependency - Issue #182 by @katxiao
- Use sklearn's OneHotEncoder instead of rdt - Issue #170 by @katxiao
- Remove KSTestExtended - Issue #180 by @katxiao
- Remove TSFClassifierEfficacy and TSFCDetection metrics - Issue #171 by @katxiao
- Update the default tags for a feature request - Issue #172 by @katxiao
- Bump github macos version - Issue #174 by @katxiao
- Fix pydocstyle to check sdmetrics - Issue #153 by @pvk-developer
- Update the RDT version to 1.0 - Issue #150 by @pvk-developer
- Update slack invite link - Issue #132 by @pvk-developer
v0.5.0 - 2022-05-11
This release fixes an error where the relational
KSTest crashes if a table doesn't have numerical columns.
It also includes some housekeeping, updating the pomegranate and copulas version requirements.
- Cap pomegranate to <0.14.7 - Issue #116 by @csala
- Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue #109 by @katxiao
v0.4.1 - 2021-12-09
This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.
- Report metric errors from compute_metrics - Issue #107 by @katxiao
- Specify default categorical transformers - Issue #105 by @katxiao
v0.4.0 - 2021-11-16
This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.
pyts- Issue #103 by @pvk-developer
- Add support for Python 3.9 - Issue #102 by @pvk-developer
- Increase code style lint - Issue #80 by @fealho
CIworkflows - Issue #79 by @pvk-developer
- Upgrade dependency ranges - Issue #69 by @katxiao
v0.3.2 - 2021-08-16
This release makes
pomegranate an optional dependency.
- Make pomegranate an optional dependency - Issue #63 by @fealho
v0.3.1 - 2021-07-12
This release fixes a bug to make the privacy metrics available in the API docs. It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.
CategoricalSVMnot being imported - Issue #65 by @csala
v0.3.0 - 2021-03-30
This release includes privacy metrics to evaluate if the real data could be obtained or
deduced from the synthetic samples. Additionally all the metrics have a
which takes the
raw_score generated by the metric and returns a value between
- Add normalize method to metrics - Issue #51 by @csala and @fealho
- Implement privacy metrics - Issue #36 by @ZhuofanXie and @fealho
v0.2.0 - 2021-02-24
Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.
v0.1.3 - 2021-02-13
Updates the required dependecies to facilitate a conda release.
- Upgrade sktime - Issue #49 by @fealho
v0.1.2 - 2021-01-27
Big fixing release that addresses several minor errors.
- More splits than classes - Issue #46 by @fealho
- Scipy 1.6.0 causes an AttributeError - Issue #44 by @fealho
- Time series metrics fails with variable length timeseries - Issue #42 by @fealho
- ParentChildDetection metrics KeyError - Issue #39 by @csala
v0.1.1 - 2020-12-30
This version adds Time Series Detection and Efficacy metrics, as well as a fix to ensure that Single Table binary classification efficacy metrics work well with binary targets which are not boolean.
- Timeseries efficacy metrics - Issue #35 by @csala
- Timeseries detection metrics - Issue #34 by @csala
- Ensure binary classification targets are bool - Issue #33 by @csala
v0.1.0 - 2020-12-18
This release introduces a new project organization and API, with metrics grouped by data modality, with a common API:
- Single Column
- Column Pair
- Single Table
- Multi Table
- Time Series
Within each data modality, different families of metrics have been implemented:
- Bayesian Network and Gaussian Mixture Likelihood
- Machine Learning Efficacy
v0.0.4 - 2020-11-27
Patch release to relax dependencies and avoid conflicts when using the latest SDV version.
v0.0.3 - 2020-11-20
Fix error on detection metrics when input data contains infinity or NaN values.
- ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala
v0.0.2 - 2020-08-08
Add support for Python 3.8 and a broader range of dependencies.
v0.0.1 - 2020-06-26
First release to PyPI.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for sdmetrics-0.13.0-py2.py3-none-any.whl