Tools to download and process public datasets from Peru (INEI & BCRP).
Project description
PyPeruStats
Allows downloading data from various data sources in Peru.
Sources: INEI, BCRP
Installation
pip install pyperustats
INEI
Parameters Description
MICRODATOS_INEI
survey: Survey type ('enaho', 'enapres', 'endes')- Available up to 2024-Quarter 3
download_default
format: Output file format- 'csv': CSV files
- 'stata': Stata files
- 'spss': SPSS files
force: Force re-download of existing filesremove_zip: Remove ZIP files after extractionworkers: Number of workers for parallel downloadzip_dir: Directory to store ZIP files
organize_files
dir_output: Directory where organized files will be savedorder_by: Organization method- 'modules': Structure "mod_01/year_n.csv"
ext_documentation: List of documentation extensionsdelete_master_dir: Delete master directory after organizing
USAGE
from pyPeruStats import MICRODATOS_INEI, print_tree
# Options: enaho, enapres, endes, available up to 2024-Quarter 3
enaho = MICRODATOS_INEI(survey="enaho")
modules = enaho.modules
# Found modules
print(modules.head(2))
codigo_modulo modulo anio
0 1 Caracterรญsticas de la Vivienda y del Hogar 2024 ...
1 2 Caracterรญsticas de los Miembros del Hogar 2024 ...
downloaded = enaho.search(
[2021, 2023, 2004, 2006, 2007, 2008], [1, 2, 3, 8]
).download_default(
format='csv', # csv, stata, spss
force=False, # download zip files again
remove_zip=False, # remove original zips from microdata page
workers=4, # Parallel download
zip_dir="trash_zips" # where zips will be downloaded
)
# Downloaded files within directory
print_tree('./trash_zips/')
๐ trash_zips
โโโ ๐ inei_enaho_download
โโโ ๐ 2004
โโโ ๐ 2004_01
โ โโโ ๐ 280-Modulo01
โ โ โโโ ๐ CED-01-100 2004.pdf
โ โ โโโ ๐ Diccionario.pdf
โ โ โโโ ๐ enaho01-2004-100.dta
โ โ โโโ ๐ Ficha Tecnica - 2004.pdf
โโโ ๐ 2004_02
....
result_files = downloaded.organize_files(
dir_output="./data_inei/", # Where files will be saved
order_by="modules", # modules: file structure "mod_01/year_n.csv" ; # year: file structure year_n/mod_n
ext_documentation=['pdf'], # files used for documentation
delete_master_dir=False # true if you want to delete all zip files and unzip again (use with caution)
)
print_tree("./data_inei/") # print file structure
๐ data_inei
โโโ ๐ documentation_pdf
โโโ ๐ 2004_01_ced-01-100_2004.pdf
โโโ ๐ 2004_01_diccionario.pdf
โโโ ๐ 2004_01_ficha_tecnica_-_2004
...
โโโ ๐ modules
โโโ ๐ 001
โโโ ๐ 2004.dta
โโโ ๐ 2006.dta
โโโ ๐ 2007.dta
โโโ ๐ 2008.csv
โโโ ๐ 2021.csv
โโโ ๐ 2023.csv
โโโ ๐ 002
โโโ ๐ 2004.dta
โโโ ๐ 2006.dta
โโโ ๐ 2007.dta
โโโ ๐ 2008.csv
....
Notes
- Parallel download significantly improves performance but consumes more resources
- It's recommended to keep original ZIP files as backup
- Check disk space before downloading multiple years/modules
- Documentation files are organized in a separate directory
BCRP
Current Issues with the Source Data
-
Inconsistent Data Formats Across Frequencies
- Spanish Month Abbreviations
For example:"Ene05"(January 2005 in Spanish format). - Complex Date Strings
Example:"31Ene05"combines day, month (abbreviated in Spanish), and year, requiring parsing. - Quarterly Indicators
Example:"T113"indicates the 1st quarter of 2013 and needs transformation to a standard format.
- Spanish Month Abbreviations
-
Additional Steps Required for Proper DataFrame Conversion
- Converting non-standard date strings to a format recognized by
pandasor similar libraries. - Harmonizing date formats across daily, monthly, quarterly, and annual frequencies.
- Converting non-standard date strings to a format recognized by
-
Slow Response Time from the BCRP UI
- The platform often experiences delays when fetching data, impacting the efficiency of workflows.
Features
- Seamless data retrieval across different time frequencies
- Automatic conversion of Spanish date formats to standard datetime
- Parallel processing capabilities
- Built-in caching mechanism
- Flexible data processing
from pyPeruStats import BCRPDataProcessor
# Define series codes
diarios = ["PD38032DD", "PD04699XD"]
mensuales = ["RD38085BM", "RD38307BM"]
trimestrales = ["PD37940PQ", "PN38975BQ"]
anuales = [
"PM06069MA",
"PM06078MA",
"PM06101MA",
" PM06088MA",
"PM06087MA",
" PM06086MA",
" PM06085MA",
" PM06084MA",
" PM06083MA",
" PM06082MA",
" PM06081MA",
" PM06070MA",
]
# Combine all frequencies
all_freq = diarios + mensuales + trimestrales + anuales
# Initialize processor
processor = BCRPDataProcessor(
all_freq,
start_date="2002-01-02",
end_date="2023-01-01",
parallel=True
)
# Process data
data = processor.process_data(save_sqlite=True)
# Access DataFrames by frequency
anuales_df = data.get("A")
trimestrales_df = data.get("Q")
mensuales_df = data.get("M")
diarios_df = data.get("D")
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Apache 2.0
Contact
TODO
- BCRP
- Download statistical data from BCRP
- Implement advanced data search functionality
- Create autoplot functionality (inspired by ggplot)
- Set up GitHub repository and backup mechanism
- Add comprehensive documentation
- Create example notebooks
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file perustats-0.1.5.tar.gz.
File metadata
- Download URL: perustats-0.1.5.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40eab39774b50bbf098af05a6db25efa21bfc8a6a71c1a73b4cb7d3769d1f4b6
|
|
| MD5 |
74fa13162efa36233db444e5fedb9fb5
|
|
| BLAKE2b-256 |
edec50bebc5465b9d8008a40aa9fa98167d28985748030c07c473d69c8d5fed9
|
File details
Details for the file perustats-0.1.5-py3-none-any.whl.
File metadata
- Download URL: perustats-0.1.5-py3-none-any.whl
- Upload date:
- Size: 22.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55589520772a6f2ee1c00aa6b4a42a0a0d27e8b50e39df6f08067b4cda9090ed
|
|
| MD5 |
1722bfdeda58b0a4d4fa70013f163374
|
|
| BLAKE2b-256 |
c7faa43dfec70395031603aebede6010ba0595dd0e576f2cad72b703c82d3869
|