Skip to main content

A python tool to process the Censored QCEW for PR

Project description

QCEW Data Processing Tool

This tool is part of a collaboration between the University of Puerto Rico, Mayaguez, and Puerto Rico's Planning Board. Its main objective is to convert raw QCEW data into a DuckDB database with a geographic (geom) column that stores the coordinates of businesses.

Overview

The script takes raw Quarterly Census Employment and Wages (QCEW) data from the data/raw directory, processes it, and stores it in a DuckDB database. The raw data should be organized in subfolders by year (e.g., data/raw/2002, data/raw/2003), with each year folder containing the data for each quarter. The resulting database will be available in the data directory with a .ddb extension.

This tool also incorporates geospatial data, using latitude and longitude values to create a point geometry (geom) for each business, allowing for geographic queries.

Requirements

To run this tool, you will need the following Python packages:

  • ibis
  • duckdb
  • polars
  • geopandas
  • tqdm
  • requests
  • json
  • logging
  • pandas

You can install the necessary dependencies using:

pip install -r requirements.txt

Or utilize the uv to ensure compatibility

uv sync

File Structure

The data should be organized in the following structure:

data/
├── raw/
│   ├── 2002/
│   ├── 2003/
│   └── ...
├── processed/
├── external/
│   └── decode.json
└── data.ddb
  • data/raw/: This directory contains the raw QCEW data, organized by year and quarter.
  • data/processed/: This directory is for storing processed data.
  • data/external/: This directory contains external files, including decode.json, which is required for decoding the raw data files.
  • data.ddb: The output DuckDB database containing the processed data.

How It Works

  1. Initialization: The script checks for necessary directories (raw, processed, external) and creates them if they don't exist. It also downloads external files, such as decode.json, if not already present. This file holds the Census codification of the data.

  2. Data Processing:

    • The tool reads raw data files, cleans them, and extracts relevant fields based on predefined column widths defined in decode.json.
    • The cleaned data includes geographic coordinates (latitude and longitude), which are then transformed into a geom column of type Point.
    • This processed data is inserted into a DuckDB database.
  3. Group and Aggregate Data:

    • The data is grouped by NAICS code (4-digit), year, and quarter, aggregating information such as total wages and total employment.
    • Additional calculations are performed for contributions to the social security, Medicare, and other funds.
  4. Joining with External Data:

    • The tool also allows for joining the QCEW data with external data (e.g., hactable) based on NAICS codes, facilitating further analysis.

Key Functions

  • make_qcew_dataset: Processes all the raw QCEW data and inserts it into the DuckDB database.
  • clean_txt: Cleans and formats the raw text data, extracting relevant fields and generating geographic information.
  • group_by_naics_code: Groups data by NAICS code and aggregates the total wages and employment.
  • unique_naics_code: Joins the grouped QCEW data with external data based on the NAICS code.
  • pull_file: Downloads external files from a given URL (e.g., decode.json).

Usage

  1. Organize your raw QCEW data by year and quarter in the data/raw/ folder.
  2. Ensure that decode.json is in the data/external/ folder.
  3. Run the script to process the data:
python main.py

Logging

The script logs key events and warnings to a file called data_process.log. This includes information about successfully processed files, warnings for empty files, and other runtime details.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contributing

Contributions to this tool are welcome. Please fork the repository and submit a pull request with any improvements or bug fixes.

If you have any questions or need further assistance, feel free to reach out!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_qcew-3.0.1.tar.gz (88.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jp_qcew-3.0.1-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file jp_qcew-3.0.1.tar.gz.

File metadata

  • Download URL: jp_qcew-3.0.1.tar.gz
  • Upload date:
  • Size: 88.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Rocky Linux","version":"9.6","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.0.1.tar.gz
Algorithm Hash digest
SHA256 e75cc36bf3180fd87f0f825fb2105bbf6fdcbf1be9940dd978740f4afd97124a
MD5 96c3736e962e41c5c087162ec18409e9
BLAKE2b-256 7225476698a4e5daf9e8f834ad3fca64df988efb5a2bb5c2c452d152a79f01a3

See more details on using hashes here.

File details

Details for the file jp_qcew-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: jp_qcew-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Rocky Linux","version":"9.6","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6db0bd01f9c92381501ef8d7beef039796164bed1cb403b367730a38e5b04305
MD5 d20a6bf933e246a58be1f89170ab95a4
BLAKE2b-256 36556dba076b984451a2457c2bb9701710ad191f32ed9038c6851a9dcb347d46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page