A tool for managing survey/administrative data.
Project description
OpenFisca Survey Manager
[EN] Introduction
OpenFisca is a versatile microsimulation free software. You can check the online documentation for more details.
This repository contains the Survey-Manager module, to work with OpenFisca and survey data.
It provides two main features:
- A Python API to access data in Hierarchical Data Format (HDF) or Parquet.
- A script that transforms Parquet, SAS, Stata, SPSS, and CSV data files to HDF data files, along with some metadata so they can be used by the Python API. If the format is Parquet, it is kept as is.
For France survey data, you might find useful information on the next steps in openfisca-france-data repository.
[FR] Introduction
OpenFisca est un logiciel libre de micro-simulation. Pour plus d'information, vous pouvez consulter la documentation officielle.
Ce dépôt contient le module Survey-Manager. Il facilite l'usage d'OpenFisca avec des données d'enquête.
Il fournit deux fonctionnalités principales:
- Une API Python permettant l'accès à des données au format Hierarchical Data Format (HDF) ou Parquet.
- Un script qui tranforme les fichiers de données aux formats SAS, Stata, SPSS, and CSV data files en fichiers de données au format HDF, avec quelques metadonnées leur permettant d'être utilisés par l'API Python. Si le format est Parquet, il est conservé tel quel.
Si vous disposez de données d'enquête sur la France, le dépôt openfisca-france-data pourrait être utile à vos prochaines étapes de traitement.
Environment
OpenFisca-Survey-Manager runs on Python 3.9. More recent versions should work, but are not tested.
Usage
Installation
Install with PIP
If you're developing your own script or looking to run OpenFisca-Survey-Manager
without editing it, you don't need to get its source code. It just needs to be known by your environment.
To do so, first, install the package with pip
:
pip install --upgrade pip
pip install openfisca-survey-manager
This should not display any error and end with:
Successfully installed [... openfisca-survey-manager-xx.xx.xx ...]
It comes with build-collection
command that we will use in the next steps.
If you want to improve this module, please see the
Development
section below.
Install with Conda
Create an anvironment and install openfisca-survey-manager
conda create -n survey python=3.9
conda activate survey
conda install -c conda-forge -c openfisca openfisca-survey-manager
You are ready to go !
To exit your environment:
conda deactivate
Getting the configuration directory path
To be able to use OpenFisca-Survey-Manager, you have to create two configuration files:
raw_data.ini
,- and
config.ini
.
To know where to copy them to, use the following command:
build-collection --help
You should get the following result.
usage: build-collection [-h] -c COLLECTION [-d] [-m] [-p PATH] [-s SURVEY]
[-v]
optional arguments:
-h, --help show this help message and exit
-c COLLECTION, --collection COLLECTION
name of collection to build or update
-d, --replace-data erase existing survey data HDF5 file (instead of
failing when HDF5 file already exists)
-m, --replace-metadata
erase existing collection metadata JSON file (instead
of just adding new surveys)
-p PATH, --path PATH path to the config files directory (default =
/your/path/.config/openfisca-survey-manager)
-s SURVEY, --survey SURVEY
name of survey to build or update (default = all)
-v, --verbose increase output verbosity
Take note of the default configuration directory path in -p PATH, --path PATH
option's description. This is the directory where you will set your raw_data.ini
and config.ini
files. In this example, it is /Users/you/.config/openfisca-survey-manager
.
If you want to use a different path, you can pass the
--path /another/path
option tobuild-collection
. This feature is still experimental though.
Editing the config files
Configuration files are INI files (text files).
The raw_data.ini
lists your input surveys while config.ini
specifies the paths to SurveyManager outputs.
raw_data.ini
andconfig.ini
must not be committed (they are already ignored by.gitignore
).
raw_data.ini, for inputs configuration
To initialise your raw_data.ini
file, you can follow these steps:
-
Copy the template file raw_data_template.ini to the configuration directory path you identified in the previous step and rename it to
raw_data.ini
. Ex:/your/path/.config/openfisca-survey-manager/raw_data.ini
-
Edit the latter by adding a section title for your survey. For example, if you name your survey
housing_survey
, you should get a line with:
[housing_survey]
- Add a reference to the location of your raw data directory (SAS, stata DTA files, SPSS, CSV files).
For paths in Windows, use
/
instead of\
to separate folders. You do not need to put quotes, even when the path name contains spaces.
Your file should look like this:
[housing_survey]
2014 = /path/to/your/raw/data/HOUSING_2014
You can also set multiple surveys as follows:
[revenue_survey]
2014 = /path/to/your/raw/data/REVENUE_2014
2015 = /path/to/your/raw/data/REVENUE_2015
2016 = /path/to/your/raw/data/REVENUE_2016
[housing_survey]
2014 = /path/to/your/raw/data/HOUSING_2014
config.ini, for outputs configuration
To initilalise your config.ini
file:
-
Copy its template file config_template.ini to your configuration directory and rename it to
config.ini
. Ex:/your/path/.config/openfisca-survey-manager/config.ini
. -
Define a
collections_directory
path where the SurveyManager will generate your survey inputs and outputs JSON description. Ex:/.../openfisca-survey-manager/transformed_housing_survey
For ahousing_survey
, you will get a/.../openfisca-survey-manager/transformed_housing_survey/housing_survey.json
file. -
Define an
output_directory
where the generated HDF file will be registered. This directory could be a sub-directory of yourcollections_directory
. -
Define a
tmp_directory
that will store temporay calculation results. Its content will be deleted at the end of the calculation. This directory could be a sub-directory of yourcollections_directory
.
Your config.ini
file should look similar to this:
[collections]
collections_directory = /path/to/your/collections/directory
[data]
output_directory = /path/to/your/data/output/directory
tmp_directory = /path/to/your/data/tmp/directory
Make sure those directories exist, otherwise the script will fail.
Building the HDF5 files
This step will read your configuration files and you survey data and generate a HDF5 file (.h5
) for your survey.
To build the HDF5 files, we'll use the build-collection
script.
Here is an example for one survey with one serie: our housing_survey
that knows only 2014 serie. We call our survey as a collection (with -c
option) and build the HDF5 file with this command:
build-collection -c housing_survey -d -m -v
-d -m
options put you on the safe side as they remove previous outputs if they exist.
It will generate:
- A
housing_survey.json
listing ahousing_survey_2014
survey with both:- your input
tables
and your input file paths in aninformations
key, - the transformed survey path in a
hdf5_file_path
key.
- your input
- Your transformed survey in a
housing_survey_2014.h5
file.
build-collection, what else?
As build-collection --help
shows, other options exist. Here are other usage examples.
If you have multiple series of one survey like the revenue_survey
, you can run the specific 2015
serie with:
build-collection -c revenue_survey -s 2015 -d -m -v
Or if you want to specify a different configuration directory path:
build-collection -p /another/path -c housing_survey -s 2014 -d -m -v
The
--path /another/path
option is still experimental though.
It should work. If it doesn't, please do not hesitate to open an issue.
Parquet files
Parquet files could be used as input files. They will not be converted to HDF5. As Parquet files can only contains one table, we add a "parquet_file"
key to each table in a survey. This key contains the path to the Parquet file, or the folder containing many parquet files for the same table.
If using folder you have to name your files with the following pattern: some_name_-<number>.parquet
and keep only the files for the same table in the same folder.
If a single file contains all the table, you can have many files for different tables in the same folder.
Development
If you want to contribute to OpenFisca-Survey-Manager, please be welcomed! To install it locally in development mode:
git clone https://github.com/openfisca/openfisca-survey-manager.git
cd openfisca-survey-manager
make install
Testing
To run the entire test suite:
make test
To run the entire test suite with the same config as in Continuous Integration (CI):
CI=True make test
Style
This repository adheres to a certain coding style, and we invite you to follow it for your contributions to be integrated promptly.
To run the style checker:
make check-style
To automatically style-format your code changes:
make format-style
To automatically style-format your code changes each time you commit:
touch .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit
tee -a .git/hooks/pre-commit << END
#!/bin/sh
#
# Automatically format your code before committing.
exec make format-style
END
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for openfisca_survey_manager-2.2.7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8bfeb3bf06e443641ac5a14ec0151f07c51871b05a1c5c2152799bdf99f9609 |
|
MD5 | c5978ca76746109f9a7b79b8e05404d8 |
|
BLAKE2b-256 | e2c3b35b39d3738ee85be38ea302b7908c8adf3ef16b9444528649eef20ce814 |
Hashes for OpenFisca_Survey_Manager-2.2.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a92e7892c8f2fd5005adebc0a0ba8f18d8e9d9af586a50ec616646f9515ff711 |
|
MD5 | f3d64f6d45d4248ccd4ededce69890d5 |
|
BLAKE2b-256 | 442daef194c1f612ca8f7b29d1cdc7958d8af8bd1048391d3585bf07decc0d06 |