Skip to main content

Tools to download and process public datasets from Peru (INEI & BCRP).

Project description

PyPeruStats

Allows downloading data from various data sources in Peru.

Sources: INEI, BCRP

Installation

pip install pyperustats

INEI

Parameters Description

MICRODATOS_INEI

  • survey: Survey type ('enaho', 'enapres', 'endes')
    • Available up to 2024-Quarter 3

download_default

  • format: Output file format
    • 'csv': CSV files
    • 'stata': Stata files
    • 'spss': SPSS files
  • force: Force re-download of existing files
  • remove_zip: Remove ZIP files after extraction
  • workers: Number of workers for parallel download
  • zip_dir: Directory to store ZIP files

organize_files

  • dir_output: Directory where organized files will be saved
  • order_by: Organization method
    • 'modules': Structure "mod_01/year_n.csv"
  • ext_documentation: List of documentation extensions
  • delete_master_dir: Delete master directory after organizing

USAGE

from pyPeruStats import MICRODATOS_INEI, print_tree

# Options: enaho, enapres, endes, available up to 2024-Quarter 3
enaho = MICRODATOS_INEI(survey="enaho") 
modules = enaho.modules
# Found modules 
print(modules.head(2))
   codigo_modulo                                      modulo      anio
0              1  Caracterรญsticas de la Vivienda y del Hogar  2024 ...
1              2   Caracterรญsticas de los Miembros del Hogar  2024 ...
downloaded = enaho.search(
    [2021, 2023, 2004, 2006, 2007, 2008], [1, 2, 3, 8]
).download_default(
    format='csv', # csv, stata, spss
    force=False, # download zip files again
    remove_zip=False, # remove original zips from microdata page
    workers=4,  # Parallel download
    zip_dir="trash_zips" # where zips will be downloaded
)

# Downloaded files within directory
print_tree('./trash_zips/')
๐Ÿ“ trash_zips
โ””โ”€โ”€ ๐Ÿ“ inei_enaho_download
    โ”œโ”€โ”€ ๐Ÿ“ 2004
        โ”œโ”€โ”€ ๐Ÿ“ 2004_01
        โ”‚   โ””โ”€โ”€ ๐Ÿ“ 280-Modulo01
        โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ CED-01-100 2004.pdf
        โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ Diccionario.pdf
        โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ enaho01-2004-100.dta
        โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ Ficha Tecnica - 2004.pdf
        โ”œโ”€โ”€ ๐Ÿ“ 2004_02
....
result_files = downloaded.organize_files(
    dir_output="./data_inei/", # Where files will be saved
    order_by="modules", # modules: file structure "mod_01/year_n.csv" ; # year: file structure year_n/mod_n
    ext_documentation=['pdf'], # files used for documentation
    delete_master_dir=False # true if you want to delete all zip files and unzip again (use with caution)
)
print_tree("./data_inei/") # print file structure
๐Ÿ“ data_inei
โ”œโ”€โ”€ ๐Ÿ“ documentation_pdf
    โ”œโ”€โ”€ ๐Ÿ“„ 2004_01_ced-01-100_2004.pdf
    โ”œโ”€โ”€ ๐Ÿ“„ 2004_01_diccionario.pdf
    โ”œโ”€โ”€ ๐Ÿ“„ 2004_01_ficha_tecnica_-_2004
...
โ””โ”€โ”€ ๐Ÿ“ modules
    โ”œโ”€โ”€ ๐Ÿ“ 001
        โ”œโ”€โ”€ ๐Ÿ“„ 2004.dta
        โ”œโ”€โ”€ ๐Ÿ“„ 2006.dta
        โ”œโ”€โ”€ ๐Ÿ“„ 2007.dta
        โ”œโ”€โ”€ ๐Ÿ“„ 2008.csv
        โ”œโ”€โ”€ ๐Ÿ“„ 2021.csv
        โ””โ”€โ”€ ๐Ÿ“„ 2023.csv
    โ”œโ”€โ”€ ๐Ÿ“ 002
        โ”œโ”€โ”€ ๐Ÿ“„ 2004.dta
        โ”œโ”€โ”€ ๐Ÿ“„ 2006.dta
        โ”œโ”€โ”€ ๐Ÿ“„ 2007.dta
        โ”œโ”€โ”€ ๐Ÿ“„ 2008.csv
....

Notes

  1. Parallel download significantly improves performance but consumes more resources
  2. It's recommended to keep original ZIP files as backup
  3. Check disk space before downloading multiple years/modules
  4. Documentation files are organized in a separate directory

BCRP

Current Issues with the Source Data

  1. Inconsistent Data Formats Across Frequencies

    • Spanish Month Abbreviations
      For example: "Ene05" (January 2005 in Spanish format).
    • Complex Date Strings
      Example: "31Ene05" combines day, month (abbreviated in Spanish), and year, requiring parsing.
    • Quarterly Indicators
      Example: "T113" indicates the 1st quarter of 2013 and needs transformation to a standard format.
  2. Additional Steps Required for Proper DataFrame Conversion

    • Converting non-standard date strings to a format recognized by pandas or similar libraries.
    • Harmonizing date formats across daily, monthly, quarterly, and annual frequencies.
  3. Slow Response Time from the BCRP UI

    • The platform often experiences delays when fetching data, impacting the efficiency of workflows.

Features

  • Seamless data retrieval across different time frequencies
  • Automatic conversion of Spanish date formats to standard datetime
  • Parallel processing capabilities
  • Built-in caching mechanism
  • Flexible data processing
from pyPeruStats import BCRPDataProcessor

# Define series codes
diarios = ["PD38032DD", "PD04699XD"]
mensuales = ["RD38085BM", "RD38307BM"]
trimestrales = ["PD37940PQ", "PN38975BQ"]
anuales = [
    "PM06069MA",
    "PM06078MA",
    "PM06101MA",
    "	PM06088MA",
    "PM06087MA",
    "	PM06086MA",
    "	PM06085MA",
    "	PM06084MA",
    "	PM06083MA",
    "	PM06082MA",
    "	PM06081MA",
    "	PM06070MA",
]

# Combine all frequencies
all_freq = diarios + mensuales + trimestrales + anuales

# Initialize processor
processor = BCRPDataProcessor(
    all_freq, 
    start_date="2002-01-02", 
    end_date="2023-01-01", 
    parallel=True
)

# Process data
data = processor.process_data(save_sqlite=True)

# Access DataFrames by frequency
anuales_df = data.get("A")
trimestrales_df = data.get("Q")
mensuales_df = data.get("M")
diarios_df = data.get("D")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache 2.0

Contact

fr.jhonk@gmail.com

TODO

  • BCRP
    • Download statistical data from BCRP
    • Implement advanced data search functionality
    • Create autoplot functionality (inspired by ggplot)
    • Set up GitHub repository and backup mechanism
    • Add comprehensive documentation
    • Create example notebooks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perustats-0.1.5.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

perustats-0.1.5-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file perustats-0.1.5.tar.gz.

File metadata

  • Download URL: perustats-0.1.5.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for perustats-0.1.5.tar.gz
Algorithm Hash digest
SHA256 40eab39774b50bbf098af05a6db25efa21bfc8a6a71c1a73b4cb7d3769d1f4b6
MD5 74fa13162efa36233db444e5fedb9fb5
BLAKE2b-256 edec50bebc5465b9d8008a40aa9fa98167d28985748030c07c473d69c8d5fed9

See more details on using hashes here.

File details

Details for the file perustats-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: perustats-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for perustats-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 55589520772a6f2ee1c00aa6b4a42a0a0d27e8b50e39df6f08067b4cda9090ed
MD5 1722bfdeda58b0a4d4fa70013f163374
BLAKE2b-256 c7faa43dfec70395031603aebede6010ba0595dd0e576f2cad72b703c82d3869

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page