Conditional microdata synthesis using normalizing flows
Project description
micro
Conditional microdata synthesis using normalizing flows.
Overview
micro synthesizes survey microdata while preserving:
- Conditional relationships: Generate target variables given demographics
- Zero-inflated distributions: Handle variables that are 0 for many observations
- Joint correlations: Preserve relationships between target variables
- Hierarchical structures: Keep household/firm compositions intact
Installation
pip install micro
Quick Start
from micro import Synthesizer
import pandas as pd
# Load training data with known target variables
training_data = pd.read_csv("survey_with_income.csv")
# Initialize synthesizer
synth = Synthesizer(
target_vars=["income", "expenditure", "savings"],
condition_vars=["age", "education", "region"],
)
# Fit on training data
synth.fit(training_data, weight_col="weight", epochs=100)
# Generate synthetic targets for new demographics
new_demographics = pd.read_csv("demographics_only.csv")
synthetic = synth.generate(new_demographics)
Why micro?
| Feature | micro | CT-GAN | TVAE | synthpop |
|---|---|---|---|---|
| Conditional generation | ✅ | ❌ | ❌ | ❌ |
| Zero-inflation handling | ✅ | ❌ | ❌ | ⚠️ |
| Exact likelihood | ✅ | ❌ | ❌ | N/A |
| Stable training | ✅ | ⚠️ | ✅ | ✅ |
| Preserves source structure | ✅ | ❌ | ❌ | ⚠️ |
Use Cases
- Survey enhancement: Impute income variables from tax data onto census demographics
- Privacy-preserving synthesis: Generate synthetic data that preserves statistical properties without copying real records
- Data fusion: Combine variables from multiple surveys with different sample designs
- Missing data imputation: Fill in missing values conditioned on observed variables
Architecture
┌─────────────────────────────────────────────────────────┐
│ Synthesizer │
├─────────────────────────────────────────────────────────┤
│ │
│ Training: │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Training │───▶│ Transformer │───▶│ Normalizing │ │
│ │ Data │ │ (log, std) │ │ Flow │ │
│ └──────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Generation: │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Context │───▶│ Zero + Flow │───▶│ Inverse │ │
│ │ Vars │ │ Sampling │ │ Transform │ │
│ └──────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Documentation
Full documentation at cosilicoai.github.io/micro
Benchmarks
See benchmarks/ for comparisons against:
- CT-GAN: Conditional Tabular GAN (from SDV)
- TVAE: Tabular VAE (from SDV)
- Copulas: Gaussian copula synthesis (from SDV)
- synthpop: CART-based synthesis (R package, via rpy2)
Citation
@software{micro2024,
author = {Cosilico},
title = {micro: Conditional microdata synthesis using normalizing flows},
year = {2024},
url = {https://github.com/CosilicoAI/micro}
}
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file microsynth-0.1.0.tar.gz.
File metadata
- Download URL: microsynth-0.1.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f4f8c312ec24aee26aa6e6deebe06e0217934572b8dfe757fc9486bcc8ff9bd
|
|
| MD5 |
cdf88f20c8f93e0f3f160c34ce750206
|
|
| BLAKE2b-256 |
6c5d0b2bfdaf8b8920a5da8af799e133a450003e8862c439181060b5f5b6049d
|
File details
Details for the file microsynth-0.1.0-py3-none-any.whl.
File metadata
- Download URL: microsynth-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a3983270ebf76d524c8b4613712337f337404a7132a474e3eff9a872bc2cd22
|
|
| MD5 |
5fede3a3e6c25747832681ec210e552b
|
|
| BLAKE2b-256 |
58b82f3ca105a41d1e05cfe340db620ece98f2f79b0374db8a66ec21028fd6fe
|