A python tool to process the Censored QCEW for PR

Project description

QCEW Data Processing Tool

[!IMPORTANT] Project development has moved to Codeberg

This tool is part of a collaboration between the University of Puerto Rico, Mayaguez, and Puerto Rico's Planning Board. Its main objective is to extract, clean, and process raw Quarterly Census of Employment and Wages (QCEW) data into structured formats optimized for high-performance economic and geographic analytics.

Overview

The pipeline reads raw fixed-width QCEW data from the data/qcew/ directory, processes it through an internal JSON layout schema layout, and caches intermediate tables into structured Parquet files partitioned by year.

The pipeline filters out records from the year 2002 or earlier, casts critical data metrics (employment indices, total/taxable wages), handles missing geospatial parameters safely, and leverages Polars and DuckDB to yield a single, integrated dataset for downstream workflows.

Requirements

To run this tool, you will need the following Python packages:

duckdb
polars
geopandas
pandas
tqdm
logging

You can install the dependencies via pip:

pip install -r requirements.txt

Or utilize uv to lock and sync your environment instantly:

uv sync

File Structure

The workspace expects files to be organized in the following directory layout:

data/
├── qcew/
│   ├── 2003/
│   │   ├── qcew_file_q1.txt
│   │   └── ...
│   ├── 2004/
│   └── ...
└── processed/
    └── qcew/
        ├── 2003/
        │   ├── data-1.parquet
        │   └── ...
        └── 2004/

data/qcew/: Contains raw text data subfolders partitioned by year. Note: Folders with a year value $\le$ 2002 are automatically skipped by the processing architecture.
data/processed/qcew/{year}/: Automatically generated storage location containing clean, structured data chunks saved as individual compressed .parquet files.

How It Works

1. Initialization

The class (CleanQCEW) initializes tracking to your storage path, initializes an isolated in-memory duckdb connection session, configures runtime logging, and references the system's package-embedded decode.json structural layout via standard library resources.

2. Fast Fixed-Width Parsing

Raw textual inputs are streamed directly into Polars string blocks using multi-threaded null-byte delimiting, which cuts down overhead compared to standard Python line reading. Using the configuration from decode.json, fields are accurately sliced, cropped of padded spaces, and named.

3. Data Transformation & Alignment

Columns representing geographical points (latitude, longitude), indices (year, qtr), and monetary metrics (total_wages, taxable_wages, monthly employment statistics) are cast to optimized types (Float64 / Int64) safely without throwing schema exceptions.
Metadata attributes (file_year, file_qtr) are appended natively before individual files are written out to target Parquet archives on disk.

4. Aggregation and Return

The tool uses an underlying DuckDB instance to query the full tree map of parquet files across all years in parallel, converting the aggregated database response directly into an in-memory pl.DataFrame.

Key Functions

__init__: Sets up pipeline directory mappings, spawns the central connection instance, and loads internal schema rules.
make_qcew_dataset: Scans the input directories, runs the validation checks, manages chunk-saving states, and returns the final unified dataset.
clean_txt: Performs raw text ingestion and extracts relevant structural fields based on layout specification boundaries.

Usage

Organize your raw data folders inside your local storage folder (default: data/qcew/).
Run your pipeline orchestration module:

python main.py

Logging

Operational timelines, warning updates, and operational status parameters are automatically streamed into a file called data_process.log formatted with active millisecond execution timestamps:

20-May-26 07:45:12 - INFO - File data/qcew/2003/raw_data.txt 1 has been inserted into the database.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contributing

Contributions to this tool are welcome. Please fork the repository and submit a pull request with any improvements or bug fixes on Codeberg.

Cite

@software{ouslan2026jpqcew,
    author       = {Ouslan, Alejandro},
    title        = {JP-QCEW},
    month        = jan,
    year         = 2026,
    publisher    = {Zenodo},
    version      = {3.0.1},
    doi          = {10.5281/zenodo.18121581},
    url          = {https://doi.org/10.5281/zenodo.18121581}
}

Project details

Release history Release notifications | RSS feed

This version

3.2.2

May 20, 2026

3.2.1

Mar 13, 2026

3.2.0

Feb 22, 2026

3.0.1

Dec 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_qcew-3.2.2.tar.gz (156.2 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jp_qcew-3.2.2-py3-none-any.whl (23.4 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file jp_qcew-3.2.2.tar.gz.

File metadata

Download URL: jp_qcew-3.2.2.tar.gz
Upload date: May 20, 2026
Size: 156.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.2.2.tar.gz
Algorithm	Hash digest
SHA256	`d300487e9420dbcefb8cc77ca65460771e44b47dde681fe9d9281e9f4f93895f`
MD5	`eed7d81d746940c0a9efe8ef624c6d58`
BLAKE2b-256	`d48dfa96ba00b6c2410b78bd0ae4416059b2f617f0fc2ef7303a889b09dc5e81`

See more details on using hashes here.

File details

Details for the file jp_qcew-3.2.2-py3-none-any.whl.

File metadata

Download URL: jp_qcew-3.2.2-py3-none-any.whl
Upload date: May 20, 2026
Size: 23.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_qcew-3.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e8dfd67f8c961088eed0304f833120101919e19df89cd5f9ccb9b14abaea411`
MD5	`c3507c01a4ddc91aef55eed75ba0a11f`
BLAKE2b-256	`fdda63ecaadb0a39e4d078c34b646f0a1e36a0c4d22dd37ffecdda572edc813b`

See more details on using hashes here.

jp-qcew 3.2.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

QCEW Data Processing Tool

Overview

Requirements

File Structure

How It Works

1. Initialization

2. Fast Fixed-Width Parsing

3. Data Transformation & Alignment

4. Aggregation and Return

Key Functions

Usage

Logging

License

Contributing

Cite

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes