Tools to download and process public datasets from Peru (INEI & BCRP).
Project description
PyPeruStats
Allows downloading data from various data sources in Peru.
Sources: INEI, BCRP
Installation
pip install perustats
INEI Microdata
Overview
The MicrodatosINEIFetcher class provides a comprehensive tool for downloading and organizing INEI (Instituto Nacional de Estadística e Informática) microdata from Peruvian surveys.
Key Features
- Fetch available modules across multiple years
- Download data files in multiple formats (Stata, SPSS, CSV, DBF)
- Parallel downloading with configurable workers
- Automatic ZIP extraction and validation
- Flexible file organization by module or year
- Smart documentation handling with duplicate detection
- Hash-based deduplication for PDF files
Parameters
MicrodatosINEIFetcher
-
survey: Survey type to fetch'enaho': National Household Survey'endes': Demographic and Family Health Survey'enapres': National Budget Programs Survey- Available up to 2024
-
years: List of years to fetch data for (e.g.,[2020, 2021, 2022]) -
master_directory: Root directory for storing downloaded data (default:'./microdatos_inei') -
parallel_jobs: Number of parallel download jobs (default:2)
download_zips()
-
formats: List of formats to download (default:['stata', 'spss', 'csv'])'stata': Stata .dta files'spss': SPSS .sav files'csv': CSV files'dbf': dBASE files
-
force_download: Force re-download even if file exists (default:True) -
module_codes: List of specific module codes to download (empty list = download all) -
remove_zip_after_extract: Delete ZIP files after extraction (default:False)
organize_files()
-
organize_by: Organization scheme'module': Structure by module (e.g.,001_module_name/2020_file.csv)'year': Structure by year (e.g.,2020/001_file.csv)
-
keep_original_names: Keep original filenames (default:True) -
operation: File operation type'copy': Copy files (preserves originals)'move': Move files (removes from unzipped directory)
-
docs_by_hash: Use hash-based deduplication for documentation files (default:True)- When
True, identical PDFs are stored only once regardless of filename
- When
Usage Examples
Example 1: ENDES Survey (Multiple Years and Modules)
from perustats import MicrodatosINEIFetcher
import time
# Initialize ENDES fetcher for years 1990-2023
start = time.time()
endes = MicrodatosINEIFetcher(
survey="endes",
years=list(range(1990, 2024)),
master_directory="./datos_inei",
parallel_jobs=6
)
# Fetch available modules
endes.fetch_modules()
# Download specific modules in multiple formats
endes.download_zips(
formats=["spss", "dbf", "stata", "csv"],
module_codes=[64, 65, 73, 74], # Specific modules
force_download=True,
remove_zip_after_extract=False
)
# Organize files by year
endes.organize_files(
organize_by="year",
keep_original_names=True,
operation="copy"
)
# Also organize by module
endes.organize_files(
organize_by="module",
keep_original_names=True,
operation="copy"
)
Example 2: ENAHO Survey (National Household Survey)
# Initialize ENAHO fetcher for years 2000-2024
enaho = MicrodatosINEIFetcher(
survey="enaho",
years=list(range(2000, 2025)),
master_directory="./datos_inei",
parallel_jobs=4
)
# Fetch available modules
enaho.fetch_modules()
# Display available modules (first 5)
print(enaho.modules_dataframe.head())
# Download specific modules
enaho.download_zips(
formats=["csv", "stata", "spss", "dbf"],
module_codes=[1, 13, 22, 34],
force_download=False
)
# Organize by year with original names
enaho.organize_files(
organize_by="year",
keep_original_names=True,
operation="copy",
docs_by_hash=True # Deduplicate documentation by content hash
)
# Also organize by module
enaho.organize_files(
organize_by="module",
keep_original_names=True,
operation="copy"
)
Example 3: ENAPRES Survey (Budget Programs)
# Initialize ENAPRES fetcher
enapres = MicrodatosINEIFetcher(
survey="enapres",
years=list(range(2010, 2025)),
master_directory="./datos_inei",
parallel_jobs=3
)
# Fetch and download in one chain
enapres.fetch_modules().download_zips(
formats=["stata", "csv"],
module_codes=[101, 102, 111]
).organize_files(
organize_by="module",
keep_original_names=True,
operation="move" # Move instead of copy to save space
)
Example 4: Inspect Available Modules
# Fetch modules without downloading
enaho = MicrodatosINEIFetcher(
survey="enaho",
years=[2020, 2021, 2022, 2023]
)
enaho.fetch_modules()
# View available modules
print(enaho.modules_dataframe[['year', 'module_code', 'module_name']])
Directory Structure
The fetcher creates the following directory structure:
./microdatos_inei/
└── enaho/ # Survey name
├── 0_zips/ # Downloaded ZIP files
│ ├── 2020_mod_001.zip
│ └── 2020_mod_002.zip
├── 1_unzipped/ # Extracted contents
│ ├── 2020_mod_001/
│ └── 2020_mod_002/
└── 2_organized/ # Organized files
├── by_module/ # When organize_by='module'
│ ├── 001_vivienda_hogar/
│ │ ├── 2020_file.csv
│ │ └── 2021_file.csv
│ └── 002_caracteristicas_miembros/
│ ├── 2020_file.csv
│ └── 2021_file.csv
├── by_year/ # When organize_by='year'
│ ├── 2020/
│ │ ├── 001_file.csv
│ │ └── 002_file.csv
│ └── 2021/
│ ├── 001_file.csv
│ └── 002_file.csv
└── documentation/ # PDF documentation (deduplicated)
├── 2020_mod_001_manual.pdf
└── 2020_mod_002_manual.pdf
Important Attributes
modules_dataframe: DataFrame containing all available modules for the specified yearsdocumentation_map: Dictionary mapping canonical PDF filenames to their aliases (useful for tracking deduplicated files)zip_maps: List of tuples containing (zip_path, extract_path) for all processed files
Best Practices
- Start with fewer years: Test with 2-3 years before downloading extensive ranges
- Use parallel jobs wisely: Higher values speed up downloads but consume more bandwidth
- Keep ZIP files initially: Set
remove_zip_after_extract=Falsefor backup purposes - Hash-based deduplication: Enable
docs_by_hash=Trueto avoid duplicate documentation files - Check disk space: Large surveys across many years can consume significant storage
- Use method chaining: The fluent API allows chaining
fetch_modules(),download_zips(), andorganize_files()
Performance Tips
- Use
parallel_jobs=4or higher for faster downloads (adjust based on your connection)- With 100 Mbps connection: ~134 ZIP files download in 1.20-1.30 seconds
- Use
operation='move'instead of'copy'to save disk space - Filter by
module_codesto download only needed modules - Enable
remove_zip_after_extract=Trueif storage is limited (after verifying extraction)
Notes
- ZIP file integrity is automatically validated; corrupted files are re-downloaded
- Documentation files are deduplicated by content hash when
docs_by_hash=True - The class handles network failures gracefully with automatic retries
- All file operations preserve original data integrity
BCRP
Current Issues with the Source Data
-
Inconsistent Data Formats Across Frequencies
- Spanish Month Abbreviations
For example:"Ene05"(January 2005 in Spanish format). - Complex Date Strings
Example:"31Ene05"combines day, month (abbreviated in Spanish), and year, requiring parsing. - Quarterly Indicators
Example:"T113"indicates the 1st quarter of 2013 and needs transformation to a standard format.
- Spanish Month Abbreviations
-
Additional Steps Required for Proper DataFrame Conversion
- Converting non-standard date strings to a format recognized by
pandasor similar libraries. - Harmonizing date formats across daily, monthly, quarterly, and annual frequencies.
- Converting non-standard date strings to a format recognized by
-
Slow Response Time from the BCRP UI
- The platform often experiences delays when fetching data, impacting the efficiency of workflows.
Features
- Seamless data retrieval across different time frequencies
- Automatic conversion of Spanish date formats to standard datetime
- Parallel processing capabilities
- Built-in caching mechanism
- Flexible data processing
from pyPeruStats import BCRPDataProcessor
# Define series codes
diarios = ["PD38032DD", "PD04699XD"]
mensuales = ["RD38085BM", "RD38307BM"]
trimestrales = ["PD37940PQ", "PN38975BQ"]
anuales = [
"PM06069MA",
"PM06078MA",
"PM06101MA",
" PM06088MA",
"PM06087MA",
" PM06086MA",
" PM06085MA",
" PM06084MA",
" PM06083MA",
" PM06082MA",
" PM06081MA",
" PM06070MA",
]
# Combine all frequencies
all_freq = diarios + mensuales + trimestrales + anuales
# Initialize processor
processor = BCRPDataProcessor(
all_freq,
start_date="2002-01-02",
end_date="2023-01-01",
parallel=True
)
# Process data
data = processor.process_data(save_sqlite=True)
# Access DataFrames by frequency
anuales_df = data.get("A")
trimestrales_df = data.get("Q")
mensuales_df = data.get("M")
diarios_df = data.get("D")
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Apache 2.0
Contact
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file perustats-0.1.7.tar.gz.
File metadata
- Download URL: perustats-0.1.7.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b7a04802377c681ca6f72b9f4c1d2982a8b9b867029ce132418744cb2089952
|
|
| MD5 |
4ce13967b825d9039742301e43023b40
|
|
| BLAKE2b-256 |
226495eb61084e8256bfb6227628073ff7c3a93f630e3dae2eda5c0a1be72847
|
File details
Details for the file perustats-0.1.7-py3-none-any.whl.
File metadata
- Download URL: perustats-0.1.7-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35c9772916a8d0c59de0797f26383f876e2bcafd55f2c3eef89da9fa967a7b64
|
|
| MD5 |
da9dcd07b6808cfaa38e014186199be8
|
|
| BLAKE2b-256 |
8a5c9bbcebecf66d89bbd1930fb2524c2b222425805145586a02959a4e0f6850
|