Skip to main content

A python tool to process the Censored QCEW for PR

Project description

QCEW Data Processing Tool

This tool is part of a collaboration between the University of Puerto Rico, Mayaguez, and Puerto Rico's Planning Board. Its main objective is to convert raw QCEW data into a DuckDB database with a geographic (geom) column that stores the coordinates of businesses.

Overview

The script takes raw Quarterly Census Employment and Wages (QCEW) data from the data/raw directory, processes it, and stores it in a DuckDB database. The raw data should be organized in subfolders by year (e.g., data/raw/2002, data/raw/2003), with each year folder containing the data for each quarter. The resulting database will be available in the data directory with a .ddb extension.

This tool also incorporates geospatial data, using latitude and longitude values to create a point geometry (geom) for each business, allowing for geographic queries.

Requirements

To run this tool, you will need the following Python packages:

  • ibis
  • duckdb
  • polars
  • geopandas
  • tqdm
  • requests
  • json
  • logging
  • pandas

You can install the necessary dependencies using:

pip install -r requirements.txt

Or utilize the uv to ensure compatibility

uv sync

File Structure

The data should be organized in the following structure:

data/
├── raw/
│   ├── 2002/
│   ├── 2003/
│   └── ...
├── processed/
├── external/
│   └── decode.json
└── data.ddb
  • data/raw/: This directory contains the raw QCEW data, organized by year and quarter.
  • data/processed/: This directory is for storing processed data.
  • data/external/: This directory contains external files, including decode.json, which is required for decoding the raw data files.
  • data.ddb: The output DuckDB database containing the processed data.

How It Works

  1. Initialization: The script checks for necessary directories (raw, processed, external) and creates them if they don't exist. It also downloads external files, such as decode.json, if not already present. This file holds the Census codification of the data.

  2. Data Processing:

    • The tool reads raw data files, cleans them, and extracts relevant fields based on predefined column widths defined in decode.json.
    • The cleaned data includes geographic coordinates (latitude and longitude), which are then transformed into a geom column of type Point.
    • This processed data is inserted into a DuckDB database.
  3. Group and Aggregate Data:

    • The data is grouped by NAICS code (4-digit), year, and quarter, aggregating information such as total wages and total employment.
    • Additional calculations are performed for contributions to the social security, Medicare, and other funds.
  4. Joining with External Data:

    • The tool also allows for joining the QCEW data with external data (e.g., hactable) based on NAICS codes, facilitating further analysis.

Key Functions

  • make_qcew_dataset: Processes all the raw QCEW data and inserts it into the DuckDB database.
  • clean_txt: Cleans and formats the raw text data, extracting relevant fields and generating geographic information.
  • group_by_naics_code: Groups data by NAICS code and aggregates the total wages and employment.
  • unique_naics_code: Joins the grouped QCEW data with external data based on the NAICS code.
  • pull_file: Downloads external files from a given URL (e.g., decode.json).

Usage

  1. Organize your raw QCEW data by year and quarter in the data/raw/ folder.
  2. Ensure that decode.json is in the data/external/ folder.
  3. Run the script to process the data:
python main.py

Logging

The script logs key events and warnings to a file called data_process.log. This includes information about successfully processed files, warnings for empty files, and other runtime details.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contributing

Contributions to this tool are welcome. Please fork the repository and submit a pull request with any improvements or bug fixes.

If you have any questions or need further assistance, feel free to reach out!

Cite

@software{ouslan2026jpqcew,
    author       = {Ouslan, Alejandro},
    title        = {JP-QCEW},
    month        = jan,
    year         = 2026,
    publisher    = {Zenodo},
    version      = {3.0.1},
    doi          = {10.5281/zenodo.18121581},
    url          = {https://doi.org/10.5281/zenodo.18121581}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_qcew-3.2.0.tar.gz (152.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jp_qcew-3.2.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file jp_qcew-3.2.0.tar.gz.

File metadata

  • Download URL: jp_qcew-3.2.0.tar.gz
  • Upload date:
  • Size: 152.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Rocky Linux","version":"9.7","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.2.0.tar.gz
Algorithm Hash digest
SHA256 0ec435355318c4e88fb601215830b886ec345fa815494bc8a26b18559872be6f
MD5 e5b627dec7ce446b8b8946b87b51ea3d
BLAKE2b-256 3dbf2095b550342a773c4d8cf192f7fedf834677588b5fb3a7bf4fe39fb73588

See more details on using hashes here.

File details

Details for the file jp_qcew-3.2.0-py3-none-any.whl.

File metadata

  • Download URL: jp_qcew-3.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Rocky Linux","version":"9.7","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 272d91998a758c0afdef0889b03fa66ba54851ff741b105116aaa5490bd7700b
MD5 57dffa3b4a8bc0347aaea7ade797119d
BLAKE2b-256 c14627e3199e5743509fbb0af77edc2dad470f6259c3bda887584e12db947640

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page