Tools to download and process public datasets from Peru (INEI & BCRP).

These details have not been verified by PyPI

Project links

Project description

PyPeruStats

Allows downloading data from various data sources in Peru.

Sources: INEI, BCRP

Installation

pip install perustats

INEI Microdata

Overview

The MicrodatosINEIFetcher class provides a comprehensive tool for downloading and organizing INEI (Instituto Nacional de Estadística e Informática) microdata from Peruvian surveys.

Key Features

Fetch available modules across multiple years
Download data files in multiple formats (Stata, SPSS, CSV, DBF)
Parallel downloading with configurable workers
Automatic ZIP extraction and validation
Flexible file organization by module or year
Smart documentation handling with duplicate detection
Hash-based deduplication for PDF files

Parameters

`MicrodatosINEIFetcher`

survey: Survey type to fetch
- 'enaho': National Household Survey
- 'endes': Demographic and Family Health Survey
- 'enapres': National Budget Programs Survey
- Available up to 2024
years: List of years to fetch data for (e.g., [2020, 2021, 2022])
master_directory: Root directory for storing downloaded data (default: './microdatos_inei')
parallel_jobs: Number of parallel download jobs (default: 2)

`download_zips()`

formats: List of formats to download (default: ['stata', 'spss', 'csv'])
- 'stata': Stata .dta files
- 'spss': SPSS .sav files
- 'csv': CSV files
- 'dbf': dBASE files
force_download: Force re-download even if file exists (default: True)
module_codes: List of specific module codes to download (empty list = download all)
remove_zip_after_extract: Delete ZIP files after extraction (default: False)

`organize_files()`

organize_by: Organization scheme
- 'module': Structure by module (e.g., 001_module_name/2020_file.csv)
- 'year': Structure by year (e.g., 2020/001_file.csv)
keep_original_names: Keep original filenames (default: True)
operation: File operation type
- 'copy': Copy files (preserves originals)
- 'move': Move files (removes from unzipped directory)
docs_by_hash: Use hash-based deduplication for documentation files (default: True)
- When True, identical PDFs are stored only once regardless of filename

Usage Examples

Example 1: ENDES Survey (Multiple Years and Modules)

from perustats import MicrodatosINEIFetcher
import time

# Initialize ENDES fetcher for years 1990-2023
start = time.time()
endes = MicrodatosINEIFetcher(
    survey="endes",
    years=list(range(1990, 2024)),
    master_directory="./datos_inei",
    parallel_jobs=6
)

# Fetch available modules
endes.fetch_modules()

# Download specific modules in multiple formats
endes.download_zips(
    formats=["spss", "dbf", "stata", "csv"],
    module_codes=[64, 65, 73, 74],  # Specific modules
    force_download=True,
    remove_zip_after_extract=False
)

# Organize files by year
endes.organize_files(
    organize_by="year",
    keep_original_names=True,
    operation="copy"
)

# Also organize by module
endes.organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="copy"
)

Example 2: ENAHO Survey (National Household Survey)

# Initialize ENAHO fetcher for years 2000-2024
enaho = MicrodatosINEIFetcher(
    survey="enaho",
    years=list(range(2000, 2025)),
    master_directory="./datos_inei",
    parallel_jobs=4
)

# Fetch available modules
enaho.fetch_modules()

# Display available modules (first 5)
print(enaho.modules_dataframe.head())

# Download specific modules
enaho.download_zips(
    formats=["csv", "stata", "spss", "dbf"],
    module_codes=[1, 13, 22, 34],
    force_download=False
)

# Organize by year with original names
enaho.organize_files(
    organize_by="year",
    keep_original_names=True,
    operation="copy",
    docs_by_hash=True  # Deduplicate documentation by content hash
)

# Also organize by module
enaho.organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="copy"
)

Example 3: ENAPRES Survey (Budget Programs)

# Initialize ENAPRES fetcher
enapres = MicrodatosINEIFetcher(
    survey="enapres",
    years=list(range(2010, 2025)),
    master_directory="./datos_inei",
    parallel_jobs=3
)

# Fetch and download in one chain
enapres.fetch_modules().download_zips(
    formats=["stata", "csv"],
    module_codes=[101, 102, 111]
).organize_files(
    organize_by="module",
    keep_original_names=True,
    operation="move"  # Move instead of copy to save space
)

Example 4: Inspect Available Modules

# Fetch modules without downloading
enaho = MicrodatosINEIFetcher(
    survey="enaho",
    years=[2020, 2021, 2022, 2023]
)

enaho.fetch_modules()

# View available modules
print(enaho.modules_dataframe[['year', 'module_code', 'module_name']])

Directory Structure

The fetcher creates the following directory structure:

./microdatos_inei/
└── enaho/                          # Survey name
    ├── 0_zips/                     # Downloaded ZIP files
    │   ├── 2020_mod_001.zip
    │   └── 2020_mod_002.zip
    ├── 1_unzipped/                 # Extracted contents
    │   ├── 2020_mod_001/
    │   └── 2020_mod_002/
    └── 2_organized/                # Organized files
        ├── by_module/              # When organize_by='module'
        │   ├── 001_vivienda_hogar/
        │   │   ├── 2020_file.csv
        │   │   └── 2021_file.csv
        │   └── 002_caracteristicas_miembros/
        │       ├── 2020_file.csv
        │       └── 2021_file.csv
        ├── by_year/                # When organize_by='year'
        │   ├── 2020/
        │   │   ├── 001_file.csv
        │   │   └── 002_file.csv
        │   └── 2021/
        │       ├── 001_file.csv
        │       └── 002_file.csv
        └── documentation/          # PDF documentation (deduplicated)
            ├── 2020_mod_001_manual.pdf
            └── 2020_mod_002_manual.pdf

Important Attributes

modules_dataframe: DataFrame containing all available modules for the specified years
documentation_map: Dictionary mapping canonical PDF filenames to their aliases (useful for tracking deduplicated files)
zip_maps: List of tuples containing (zip_path, extract_path) for all processed files

Best Practices

Start with fewer years: Test with 2-3 years before downloading extensive ranges
Use parallel jobs wisely: Higher values speed up downloads but consume more bandwidth
Keep ZIP files initially: Set remove_zip_after_extract=False for backup purposes
Hash-based deduplication: Enable docs_by_hash=True to avoid duplicate documentation files
Check disk space: Large surveys across many years can consume significant storage
Use method chaining: The fluent API allows chaining fetch_modules(), download_zips(), and organize_files()

Performance Tips

Use parallel_jobs=4 or higher for faster downloads (adjust based on your connection)
- With 100 Mbps connection: ~134 ZIP files download in 1.20-1.30 seconds
Use operation='move' instead of 'copy' to save disk space
Filter by module_codes to download only needed modules
Enable remove_zip_after_extract=True if storage is limited (after verifying extraction)

Notes

ZIP file integrity is automatically validated; corrupted files are re-downloaded
Documentation files are deduplicated by content hash when docs_by_hash=True
The class handles network failures gracefully with automatic retries
All file operations preserve original data integrity

BCRP

Current Issues with the Source Data

Inconsistent Data Formats Across Frequencies
- Spanish Month Abbreviations
  For example: "Ene05" (January 2005 in Spanish format).
- Complex Date Strings
  Example: "31Ene05" combines day, month (abbreviated in Spanish), and year, requiring parsing.
- Quarterly Indicators
  Example: "T113" indicates the 1st quarter of 2013 and needs transformation to a standard format.
Additional Steps Required for Proper DataFrame Conversion
- Converting non-standard date strings to a format recognized by pandas or similar libraries.
- Harmonizing date formats across daily, monthly, quarterly, and annual frequencies.
Slow Response Time from the BCRP UI
- The platform often experiences delays when fetching data, impacting the efficiency of workflows.

Features

Seamless data retrieval across different time frequencies
Automatic conversion of Spanish date formats to standard datetime
Parallel processing capabilities
Built-in caching mechanism
Flexible data processing

from pyPeruStats import BCRPDataProcessor

# Define series codes
diarios = ["PD38032DD", "PD04699XD"]
mensuales = ["RD38085BM", "RD38307BM"]
trimestrales = ["PD37940PQ", "PN38975BQ"]
anuales = [
    "PM06069MA",
    "PM06078MA",
    "PM06101MA",
    "	PM06088MA",
    "PM06087MA",
    "	PM06086MA",
    "	PM06085MA",
    "	PM06084MA",
    "	PM06083MA",
    "	PM06082MA",
    "	PM06081MA",
    "	PM06070MA",
]

# Combine all frequencies
all_freq = diarios + mensuales + trimestrales + anuales

# Initialize processor
processor = BCRPDataProcessor(
    all_freq, 
    start_date="2002-01-02", 
    end_date="2023-01-01", 
    parallel=True
)

# Process data
data = processor.process_data(save_sqlite=True)

# Access DataFrames by frequency
anuales_df = data.get("A")
trimestrales_df = data.get("Q")
mensuales_df = data.get("M")
diarios_df = data.get("D")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache 2.0

Contact

fr.jhonk@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Apr 15, 2026

0.2.22

Apr 7, 2026

This version

0.2.21

Apr 7, 2026

0.1.7

Dec 30, 2025

0.1.5

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perustats-0.2.21.tar.gz (57.5 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

perustats-0.2.21-py3-none-any.whl (70.9 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file perustats-0.2.21.tar.gz.

File metadata

Download URL: perustats-0.2.21.tar.gz
Upload date: Apr 7, 2026
Size: 57.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for perustats-0.2.21.tar.gz
Algorithm	Hash digest
SHA256	`3357dbfc886d8e234ddf4d329a40a39bf3e2f584420281166628f07fd3a1a81e`
MD5	`cba3a1044370a563bdf15ca89d07b6d4`
BLAKE2b-256	`fdbbfcd10b56bd7d6d5ada071efcdcd9e01e511ba5afc9e69954aa5e7baed587`

See more details on using hashes here.

File details

Details for the file perustats-0.2.21-py3-none-any.whl.

File metadata

Download URL: perustats-0.2.21-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 70.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for perustats-0.2.21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b55e99cb33ccfe188c6f223d3ec125ec6b043654948ca6055486835704f7c559`
MD5	`bb4aa734f7f08066d77c707969697cdb`
BLAKE2b-256	`07a75d56501a896b9c9a1e4ae1603d854a2eaa90b2ccfbdd69e1d62a5661862f`

See more details on using hashes here.

perustats 0.2.21

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PyPeruStats

Installation

INEI Microdata

Overview

Key Features

Parameters

MicrodatosINEIFetcher

download_zips()

organize_files()

Usage Examples

Example 1: ENDES Survey (Multiple Years and Modules)

Example 2: ENAHO Survey (National Household Survey)

Example 3: ENAPRES Survey (Budget Programs)

Example 4: Inspect Available Modules

Directory Structure

Important Attributes

Best Practices

Performance Tips

Notes

BCRP

Current Issues with the Source Data

Features

Contributing

License

Contact

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`MicrodatosINEIFetcher`

`download_zips()`

`organize_files()`