Skip to main content

A python tool to process the Censored QCEW for PR

Project description

QCEW Data Processing Tool

[!IMPORTANT] Project development has moved to Codeberg

This tool is part of a collaboration between the University of Puerto Rico, Mayaguez, and Puerto Rico's Planning Board. Its main objective is to extract, clean, and process raw Quarterly Census of Employment and Wages (QCEW) data into structured formats optimized for high-performance economic and geographic analytics.

Overview

The pipeline reads raw fixed-width QCEW data from the data/qcew/ directory, processes it through an internal JSON layout schema layout, and caches intermediate tables into structured Parquet files partitioned by year.

The pipeline filters out records from the year 2002 or earlier, casts critical data metrics (employment indices, total/taxable wages), handles missing geospatial parameters safely, and leverages Polars and DuckDB to yield a single, integrated dataset for downstream workflows.


Requirements

To run this tool, you will need the following Python packages:

  • duckdb
  • polars
  • geopandas
  • pandas
  • tqdm
  • logging

You can install the dependencies via pip:

pip install -r requirements.txt

Or utilize uv to lock and sync your environment instantly:

uv sync

File Structure

The workspace expects files to be organized in the following directory layout:

data/
├── qcew/
│   ├── 2003/
│   │   ├── qcew_file_q1.txt
│   │   └── ...
│   ├── 2004/
│   └── ...
└── processed/
    └── qcew/
        ├── 2003/
        │   ├── data-1.parquet
        │   └── ...
        └── 2004/

  • data/qcew/: Contains raw text data subfolders partitioned by year. Note: Folders with a year value $\le$ 2002 are automatically skipped by the processing architecture.
  • data/processed/qcew/{year}/: Automatically generated storage location containing clean, structured data chunks saved as individual compressed .parquet files.

How It Works

1. Initialization

The class (CleanQCEW) initializes tracking to your storage path, initializes an isolated in-memory duckdb connection session, configures runtime logging, and references the system's package-embedded decode.json structural layout via standard library resources.

2. Fast Fixed-Width Parsing

Raw textual inputs are streamed directly into Polars string blocks using multi-threaded null-byte delimiting, which cuts down overhead compared to standard Python line reading. Using the configuration from decode.json, fields are accurately sliced, cropped of padded spaces, and named.

3. Data Transformation & Alignment

  • Columns representing geographical points (latitude, longitude), indices (year, qtr), and monetary metrics (total_wages, taxable_wages, monthly employment statistics) are cast to optimized types (Float64 / Int64) safely without throwing schema exceptions.
  • Metadata attributes (file_year, file_qtr) are appended natively before individual files are written out to target Parquet archives on disk.

4. Aggregation and Return

The tool uses an underlying DuckDB instance to query the full tree map of parquet files across all years in parallel, converting the aggregated database response directly into an in-memory pl.DataFrame.


Key Functions

  • __init__: Sets up pipeline directory mappings, spawns the central connection instance, and loads internal schema rules.
  • make_qcew_dataset: Scans the input directories, runs the validation checks, manages chunk-saving states, and returns the final unified dataset.
  • clean_txt: Performs raw text ingestion and extracts relevant structural fields based on layout specification boundaries.

Usage

  1. Organize your raw data folders inside your local storage folder (default: data/qcew/).
  2. Run your pipeline orchestration module:
python main.py

Logging

Operational timelines, warning updates, and operational status parameters are automatically streamed into a file called data_process.log formatted with active millisecond execution timestamps:

20-May-26 07:45:12 - INFO - File data/qcew/2003/raw_data.txt 1 has been inserted into the database.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contributing

Contributions to this tool are welcome. Please fork the repository and submit a pull request with any improvements or bug fixes on Codeberg.


Cite

@software{ouslan2026jpqcew,
    author       = {Ouslan, Alejandro},
    title        = {JP-QCEW},
    month        = jan,
    year         = 2026,
    publisher    = {Zenodo},
    version      = {3.0.1},
    doi          = {10.5281/zenodo.18121581},
    url          = {https://doi.org/10.5281/zenodo.18121581}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_qcew-3.2.2.tar.gz (156.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jp_qcew-3.2.2-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file jp_qcew-3.2.2.tar.gz.

File metadata

  • Download URL: jp_qcew-3.2.2.tar.gz
  • Upload date:
  • Size: 156.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.2.2.tar.gz
Algorithm Hash digest
SHA256 d300487e9420dbcefb8cc77ca65460771e44b47dde681fe9d9281e9f4f93895f
MD5 eed7d81d746940c0a9efe8ef624c6d58
BLAKE2b-256 d48dfa96ba00b6c2410b78bd0ae4416059b2f617f0fc2ef7303a889b09dc5e81

See more details on using hashes here.

File details

Details for the file jp_qcew-3.2.2-py3-none-any.whl.

File metadata

  • Download URL: jp_qcew-3.2.2-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5e8dfd67f8c961088eed0304f833120101919e19df89cd5f9ccb9b14abaea411
MD5 c3507c01a4ddc91aef55eed75ba0a11f
BLAKE2b-256 fdda63ecaadb0a39e4d078c34b646f0a1e36a0c4d22dd37ffecdda572edc813b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page