Package for creating synthetic datasets from datasets
Project description
MetaSynth
MetaSynth is a python package to generate synthetic data mostly geared towards code testing and reproducibility. Using the ONS methodology MetaSynth falls in the augmented plausible category. To generate synthetic data, MetaSynth first converts a pandas DataFrame into a datastructure following the GMF standard file format. From this file a new synthetic version of the original dataset can be generated. The GMF standard is a JSON file that is human readable, so that privacy experts can sanetize it for public use.
Features
- Automatic and manual distribution fitting
- Generate pandas DataFrames with the same type
- Many datatypes:
categorical
,string
,integer
,float
,date
,time
,datetime
. - Integrates with the faker package.
- Structured string detection.
- Variables that have unique values/keys.
Example
To process a dataset, first create a pandas dataframe. As an example we will use the titanic dataset:
dtypes = {
"Survived": "category", "Pclass": "category", "Name": "string",
"Sex": "category", "SibSp": "category", "Parch": "category",
"Ticket": "string", "Cabin": "string", "Embarked": "category"
}
df = pd.read_csv("titanic.csv", dtype=dtypes)
From the pandas dataframe, we create a metadataset and store it in a JSON file that follows the GMF standard:
dataset = MetaDataset.from_dataframe(df)
dataset.to_json("test.json")
Contributing
Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
To contribute:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Contact
MetaSynth is project by the ODISSEI Social Data Science (SoDa) team.
Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for metasynth-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9b0307b34dd2aeaaf02cce874ccaea68631d80423ee68ec07261e4e17f9ac58 |
|
MD5 | 4cf4ada6a2387c746cde0429271cdd2e |
|
BLAKE2b-256 | 14b9f26c724da33fb5cb5ab946b0fde00883dd2157ff42082abebdc0feb9188c |