Skip to main content

Tools to download and process public datasets from Peru (INEI & BCRP).

Project description

PyPeruStats

Allows downloading data from various data sources in Peru.

Sources: INEI, BCRP

Installation

pip install perustats

INEI Microdata

Overview

The MicrodatosINEIFetcher class provides a comprehensive tool for downloading and organizing INEI (Instituto Nacional de Estadística e Informática) microdata from Peruvian surveys.

Key Features

  • Fetch available modules across multiple years
  • Download data files in multiple formats (Stata, SPSS, CSV, DBF)
  • Parallel downloading with configurable workers
  • Automatic ZIP extraction and validation
  • Flexible file organization by module or year
  • Smart documentation handling with duplicate detection
  • Hash-based deduplication for PDF files

Parameters

MicrodatosINEIFetcher

  • survey: Survey type to fetch

    • 'enaho': National Household Survey
    • 'endes': Demographic and Family Health Survey
    • 'enapres': National Budget Programs Survey
    • Available up to 2024
  • years: List of years to fetch data for (e.g., [2020, 2021, 2022])

  • master_directory: Root directory for storing downloaded data (default: './microdatos_inei')

  • parallel_jobs: Number of parallel download jobs (default: 2)

download_zips()

  • formats: List of formats to download (default: ['stata', 'spss', 'csv'])

    • 'stata': Stata .dta files
    • 'spss': SPSS .sav files
    • 'csv': CSV files
    • 'dbf': dBASE files
  • force_download: Force re-download even if file exists (default: True)

  • module_codes: List of specific module codes to download (empty list = download all)

  • remove_zip_after_extract: Delete ZIP files after extraction (default: False)

organize_files()

  • organize_by: Organization scheme

    • 'module': Structure by module (e.g., 001_module_name/2020_file.csv)
    • 'year': Structure by year (e.g., 2020/001_file.csv)
  • keep_original_names: Keep original filenames (default: True)

  • operation: File operation type

    • 'copy': Copy files (preserves originals)
    • 'move': Move files (removes from unzipped directory)
  • docs_by_hash: Use hash-based deduplication for documentation files (default: True)

    • When True, identical PDFs are stored only once regardless of filename

Usage Examples

Example 1: ENDES Survey (Multiple Years and Modules)

from perustats import MicrodatosINEIFetcher
import time

# Initialize ENDES fetcher for years 1990-2023
start = time.time()
endes = MicrodatosINEIFetcher(
    survey="endes",
    years=list(range(1990, 2024)),
    master_directory="./datos_inei",
    parallel_jobs=6
)

# Fetch available modules
endes.fetch_modules()

# Download specific modules in multiple formats
endes.download_zips(
    formats=["spss", "dbf", "stata", "csv"],
    module_codes=[64, 65, 73, 74],  # Specific modules
    force_download=True,
    remove_zip_after_extract=False
)

# Organize files by year
endes.organize_files(
    organize_by="year",
    keep_original_names=True,
    operation="copy"
)

# Also organize by module
endes.organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="copy"
)

Example 2: ENAHO Survey (National Household Survey)

# Initialize ENAHO fetcher for years 2000-2024
enaho = MicrodatosINEIFetcher(
    survey="enaho",
    years=list(range(2000, 2025)),
    master_directory="./datos_inei",
    parallel_jobs=4
)

# Fetch available modules
enaho.fetch_modules()

# Display available modules (first 5)
print(enaho.modules_dataframe.head())

# Download specific modules
enaho.download_zips(
    formats=["csv", "stata", "spss", "dbf"],
    module_codes=[1, 13, 22, 34],
    force_download=False
)

# Organize by year with original names
enaho.organize_files(
    organize_by="year",
    keep_original_names=True,
    operation="copy",
    docs_by_hash=True  # Deduplicate documentation by content hash
)

# Also organize by module
enaho.organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="copy"
)

Example 3: ENAPRES Survey (Budget Programs)

# Initialize ENAPRES fetcher
enapres = MicrodatosINEIFetcher(
    survey="enapres",
    years=list(range(2010, 2025)),
    master_directory="./datos_inei",
    parallel_jobs=3
)

# Fetch and download in one chain
enapres.fetch_modules().download_zips(
    formats=["stata", "csv"],
    module_codes=[101, 102, 111]
).organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="move"  # Move instead of copy to save space
)

Example 4: Inspect Available Modules

# Fetch modules without downloading
enaho = MicrodatosINEIFetcher(
    survey="enaho",
    years=[2020, 2021, 2022, 2023]
)

enaho.fetch_modules()

# View available modules
print(enaho.modules_dataframe[['year', 'module_code', 'module_name']])

Directory Structure

The fetcher creates the following directory structure:

./microdatos_inei/
└── enaho/                          # Survey name
    ├── 0_zips/                     # Downloaded ZIP files
    │   ├── 2020_mod_001.zip
    │   └── 2020_mod_002.zip
    ├── 1_unzipped/                 # Extracted contents
    │   ├── 2020_mod_001/
    │   └── 2020_mod_002/
    └── 2_organized/                # Organized files
        ├── by_module/              # When organize_by='module'
        │   ├── 001_vivienda_hogar/
        │   │   ├── 2020_file.csv
        │   │   └── 2021_file.csv
        │   └── 002_caracteristicas_miembros/
        │       ├── 2020_file.csv
        │       └── 2021_file.csv
        ├── by_year/                # When organize_by='year'
        │   ├── 2020/
        │   │   ├── 001_file.csv
        │   │   └── 002_file.csv
        │   └── 2021/
        │       ├── 001_file.csv
        │       └── 002_file.csv
        └── documentation/          # PDF documentation (deduplicated)
            ├── 2020_mod_001_manual.pdf
            └── 2020_mod_002_manual.pdf

Important Attributes

  • modules_dataframe: DataFrame containing all available modules for the specified years
  • documentation_map: Dictionary mapping canonical PDF filenames to their aliases (useful for tracking deduplicated files)
  • zip_maps: List of tuples containing (zip_path, extract_path) for all processed files

Best Practices

  1. Start with fewer years: Test with 2-3 years before downloading extensive ranges
  2. Use parallel jobs wisely: Higher values speed up downloads but consume more bandwidth
  3. Keep ZIP files initially: Set remove_zip_after_extract=False for backup purposes
  4. Hash-based deduplication: Enable docs_by_hash=True to avoid duplicate documentation files
  5. Check disk space: Large surveys across many years can consume significant storage
  6. Use method chaining: The fluent API allows chaining fetch_modules(), download_zips(), and organize_files()

Performance Tips

  • Use parallel_jobs=4 or higher for faster downloads (adjust based on your connection)
    • With 100 Mbps connection: ~134 ZIP files download in 1.20-1.30 seconds
  • Use operation='move' instead of 'copy' to save disk space
  • Filter by module_codes to download only needed modules
  • Enable remove_zip_after_extract=True if storage is limited (after verifying extraction)

Notes

  • ZIP file integrity is automatically validated; corrupted files are re-downloaded
  • Documentation files are deduplicated by content hash when docs_by_hash=True
  • The class handles network failures gracefully with automatic retries
  • All file operations preserve original data integrity

BCRP

Current Issues with the Source Data

  1. Inconsistent Data Formats Across Frequencies

    • Spanish Month Abbreviations
      For example: "Ene05" (January 2005 in Spanish format).
    • Complex Date Strings
      Example: "31Ene05" combines day, month (abbreviated in Spanish), and year, requiring parsing.
    • Quarterly Indicators
      Example: "T113" indicates the 1st quarter of 2013 and needs transformation to a standard format.
  2. Additional Steps Required for Proper DataFrame Conversion

    • Converting non-standard date strings to a format recognized by pandas or similar libraries.
    • Harmonizing date formats across daily, monthly, quarterly, and annual frequencies.
  3. Slow Response Time from the BCRP UI

    • The platform often experiences delays when fetching data, impacting the efficiency of workflows.

Features

  • Seamless data retrieval across different time frequencies
  • Automatic conversion of Spanish date formats to standard datetime
  • Parallel processing capabilities
  • Built-in caching mechanism
  • Flexible data processing
from pyPeruStats import BCRPDataProcessor

# Define series codes
diarios = ["PD38032DD", "PD04699XD"]
mensuales = ["RD38085BM", "RD38307BM"]
trimestrales = ["PD37940PQ", "PN38975BQ"]
anuales = [
    "PM06069MA",
    "PM06078MA",
    "PM06101MA",
    "	PM06088MA",
    "PM06087MA",
    "	PM06086MA",
    "	PM06085MA",
    "	PM06084MA",
    "	PM06083MA",
    "	PM06082MA",
    "	PM06081MA",
    "	PM06070MA",
]

# Combine all frequencies
all_freq = diarios + mensuales + trimestrales + anuales

# Initialize processor
processor = BCRPDataProcessor(
    all_freq, 
    start_date="2002-01-02", 
    end_date="2023-01-01", 
    parallel=True
)

# Process data
data = processor.process_data(save_sqlite=True)

# Access DataFrames by frequency
anuales_df = data.get("A")
trimestrales_df = data.get("Q")
mensuales_df = data.get("M")
diarios_df = data.get("D")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache 2.0

Contact

fr.jhonk@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perustats-0.1.7.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

perustats-0.1.7-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file perustats-0.1.7.tar.gz.

File metadata

  • Download URL: perustats-0.1.7.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for perustats-0.1.7.tar.gz
Algorithm Hash digest
SHA256 9b7a04802377c681ca6f72b9f4c1d2982a8b9b867029ce132418744cb2089952
MD5 4ce13967b825d9039742301e43023b40
BLAKE2b-256 226495eb61084e8256bfb6227628073ff7c3a93f630e3dae2eda5c0a1be72847

See more details on using hashes here.

File details

Details for the file perustats-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: perustats-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for perustats-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 35c9772916a8d0c59de0797f26383f876e2bcafd55f2c3eef89da9fa967a7b64
MD5 da9dcd07b6808cfaa38e014186199be8
BLAKE2b-256 8a5c9bbcebecf66d89bbd1930fb2524c2b222425805145586a02959a4e0f6850

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page