Skip to main content

Tools to download and process public datasets from Peru (INEI & BCRP).

Project description

PyPeruStats

Allows downloading data from various data sources in Peru.

Sources: INEI, BCRP

Installation

pip install perustats

INEI Microdata

Overview

The MicrodatosINEIFetcher class provides a comprehensive tool for downloading and organizing INEI (Instituto Nacional de Estadística e Informática) microdata from Peruvian surveys.

Key Features

  • Fetch available modules across multiple years
  • Download data files in multiple formats (Stata, SPSS, CSV, DBF)
  • Parallel downloading with configurable workers
  • Automatic ZIP extraction and validation
  • Flexible file organization by module or year
  • Smart documentation handling with duplicate detection
  • Hash-based deduplication for PDF files

Parameters

MicrodatosINEIFetcher

  • survey: Survey type to fetch

    • 'enaho': National Household Survey
    • 'endes': Demographic and Family Health Survey
    • 'enapres': National Budget Programs Survey
    • Available up to 2024
  • years: List of years to fetch data for (e.g., [2020, 2021, 2022])

  • master_directory: Root directory for storing downloaded data (default: './microdatos_inei')

  • parallel_jobs: Number of parallel download jobs (default: 2)

download_zips()

  • formats: List of formats to download (default: ['stata', 'spss', 'csv'])

    • 'stata': Stata .dta files
    • 'spss': SPSS .sav files
    • 'csv': CSV files
    • 'dbf': dBASE files
  • force_download: Force re-download even if file exists (default: True)

  • module_codes: List of specific module codes to download (empty list = download all)

  • remove_zip_after_extract: Delete ZIP files after extraction (default: False)

organize_files()

  • organize_by: Organization scheme

    • 'module': Structure by module (e.g., 001_module_name/2020_file.csv)
    • 'year': Structure by year (e.g., 2020/001_file.csv)
  • keep_original_names: Keep original filenames (default: True)

  • operation: File operation type

    • 'copy': Copy files (preserves originals)
    • 'move': Move files (removes from unzipped directory)
  • docs_by_hash: Use hash-based deduplication for documentation files (default: True)

    • When True, identical PDFs are stored only once regardless of filename

Usage Examples

Example 1: ENDES Survey (Multiple Years and Modules)

from perustats import MicrodatosINEIFetcher
import time

# Initialize ENDES fetcher for years 1990-2023
start = time.time()
endes = MicrodatosINEIFetcher(
    survey="endes",
    years=list(range(1990, 2024)),
    master_directory="./datos_inei",
    parallel_jobs=6
)

# Fetch available modules
endes.fetch_modules()

# Download specific modules in multiple formats
endes.download_zips(
    formats=["spss", "dbf", "stata", "csv"],
    module_codes=[64, 65, 73, 74],  # Specific modules
    force_download=True,
    remove_zip_after_extract=False
)

# Organize files by year
endes.organize_files(
    organize_by="year",
    keep_original_names=True,
    operation="copy"
)

# Also organize by module
endes.organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="copy"
)

Example 2: ENAHO Survey (National Household Survey)

# Initialize ENAHO fetcher for years 2000-2024
enaho = MicrodatosINEIFetcher(
    survey="enaho",
    years=list(range(2000, 2025)),
    master_directory="./datos_inei",
    parallel_jobs=4
)

# Fetch available modules
enaho.fetch_modules()

# Display available modules (first 5)
print(enaho.modules_dataframe.head())

# Download specific modules
enaho.download_zips(
    formats=["csv", "stata", "spss", "dbf"],
    module_codes=[1, 13, 22, 34],
    force_download=False
)

# Organize by year with original names
enaho.organize_files(
    organize_by="year",
    keep_original_names=True,
    operation="copy",
    docs_by_hash=True  # Deduplicate documentation by content hash
)

# Also organize by module
enaho.organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="copy"
)

Example 3: ENAPRES Survey (Budget Programs)

# Initialize ENAPRES fetcher
enapres = MicrodatosINEIFetcher(
    survey="enapres",
    years=list(range(2010, 2025)),
    master_directory="./datos_inei",
    parallel_jobs=3
)

# Fetch and download in one chain
enapres.fetch_modules().download_zips(
    formats=["stata", "csv"],
    module_codes=[101, 102, 111]
).organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="move"  # Move instead of copy to save space
)

Example 4: Inspect Available Modules

# Fetch modules without downloading
enaho = MicrodatosINEIFetcher(
    survey="enaho",
    years=[2020, 2021, 2022, 2023]
)

enaho.fetch_modules()

# View available modules
print(enaho.modules_dataframe[['year', 'module_code', 'module_name']])

Directory Structure

The fetcher creates the following directory structure:

./microdatos_inei/
└── enaho/                          # Survey name
    ├── 0_zips/                     # Downloaded ZIP files
    │   ├── 2020_mod_001.zip
    │   └── 2020_mod_002.zip
    ├── 1_unzipped/                 # Extracted contents
    │   ├── 2020_mod_001/
    │   └── 2020_mod_002/
    └── 2_organized/                # Organized files
        ├── by_module/              # When organize_by='module'
        │   ├── 001_vivienda_hogar/
        │   │   ├── 2020_file.csv
        │   │   └── 2021_file.csv
        │   └── 002_caracteristicas_miembros/
        │       ├── 2020_file.csv
        │       └── 2021_file.csv
        ├── by_year/                # When organize_by='year'
        │   ├── 2020/
        │   │   ├── 001_file.csv
        │   │   └── 002_file.csv
        │   └── 2021/
        │       ├── 001_file.csv
        │       └── 002_file.csv
        └── documentation/          # PDF documentation (deduplicated)
            ├── 2020_mod_001_manual.pdf
            └── 2020_mod_002_manual.pdf

Important Attributes

  • modules_dataframe: DataFrame containing all available modules for the specified years
  • documentation_map: Dictionary mapping canonical PDF filenames to their aliases (useful for tracking deduplicated files)
  • zip_maps: List of tuples containing (zip_path, extract_path) for all processed files

Best Practices

  1. Start with fewer years: Test with 2-3 years before downloading extensive ranges
  2. Use parallel jobs wisely: Higher values speed up downloads but consume more bandwidth
  3. Keep ZIP files initially: Set remove_zip_after_extract=False for backup purposes
  4. Hash-based deduplication: Enable docs_by_hash=True to avoid duplicate documentation files
  5. Check disk space: Large surveys across many years can consume significant storage
  6. Use method chaining: The fluent API allows chaining fetch_modules(), download_zips(), and organize_files()

Performance Tips

  • Use parallel_jobs=4 or higher for faster downloads (adjust based on your connection)
    • With 100 Mbps connection: ~134 ZIP files download in 1.20-1.30 seconds
  • Use operation='move' instead of 'copy' to save disk space
  • Filter by module_codes to download only needed modules
  • Enable remove_zip_after_extract=True if storage is limited (after verifying extraction)

Notes

  • ZIP file integrity is automatically validated; corrupted files are re-downloaded
  • Documentation files are deduplicated by content hash when docs_by_hash=True
  • The class handles network failures gracefully with automatic retries
  • All file operations preserve original data integrity

BCRP

Current Issues with the Source Data

  1. Inconsistent Data Formats Across Frequencies

    • Spanish Month Abbreviations
      For example: "Ene05" (January 2005 in Spanish format).
    • Complex Date Strings
      Example: "31Ene05" combines day, month (abbreviated in Spanish), and year, requiring parsing.
    • Quarterly Indicators
      Example: "T113" indicates the 1st quarter of 2013 and needs transformation to a standard format.
  2. Additional Steps Required for Proper DataFrame Conversion

    • Converting non-standard date strings to a format recognized by pandas or similar libraries.
    • Harmonizing date formats across daily, monthly, quarterly, and annual frequencies.
  3. Slow Response Time from the BCRP UI

    • The platform often experiences delays when fetching data, impacting the efficiency of workflows.

Features

  • Seamless data retrieval across different time frequencies
  • Automatic conversion of Spanish date formats to standard datetime
  • Parallel processing capabilities
  • Built-in caching mechanism
  • Flexible data processing
from pyPeruStats import BCRPDataProcessor

# Define series codes
diarios = ["PD38032DD", "PD04699XD"]
mensuales = ["RD38085BM", "RD38307BM"]
trimestrales = ["PD37940PQ", "PN38975BQ"]
anuales = [
    "PM06069MA",
    "PM06078MA",
    "PM06101MA",
    "	PM06088MA",
    "PM06087MA",
    "	PM06086MA",
    "	PM06085MA",
    "	PM06084MA",
    "	PM06083MA",
    "	PM06082MA",
    "	PM06081MA",
    "	PM06070MA",
]

# Combine all frequencies
all_freq = diarios + mensuales + trimestrales + anuales

# Initialize processor
processor = BCRPDataProcessor(
    all_freq, 
    start_date="2002-01-02", 
    end_date="2023-01-01", 
    parallel=True
)

# Process data
data = processor.process_data(save_sqlite=True)

# Access DataFrames by frequency
anuales_df = data.get("A")
trimestrales_df = data.get("Q")
mensuales_df = data.get("M")
diarios_df = data.get("D")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache 2.0

Contact

fr.jhonk@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perustats-0.2.22.tar.gz (57.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

perustats-0.2.22-py3-none-any.whl (70.9 kB view details)

Uploaded Python 3

File details

Details for the file perustats-0.2.22.tar.gz.

File metadata

  • Download URL: perustats-0.2.22.tar.gz
  • Upload date:
  • Size: 57.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for perustats-0.2.22.tar.gz
Algorithm Hash digest
SHA256 32e05abe5eea0ab5fb240de157cd045b92be6467a6ddd79ccf5c2a9b93c175ce
MD5 3e03f8ce1df9ada7f362b4a5ba8f6978
BLAKE2b-256 643b16a5af31eadb625d288859a1d26e7f394032d0229769cac8d380ae5edd89

See more details on using hashes here.

File details

Details for the file perustats-0.2.22-py3-none-any.whl.

File metadata

  • Download URL: perustats-0.2.22-py3-none-any.whl
  • Upload date:
  • Size: 70.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for perustats-0.2.22-py3-none-any.whl
Algorithm Hash digest
SHA256 90dc02ab5a71b35118b5eb3520b437bf338010c4d086ccefb6197c2fca4fe470
MD5 50f0a135748beb027389d12d68e60af1
BLAKE2b-256 6234adfeb0c50eeb9742731e63d715fa66b43a52896d82ae3783069ee4613857

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page