A python tool to process the Censored QCEW for PR
Project description
QCEW Data Processing Tool
[!IMPORTANT] Project development has moved to Codeberg
This tool is part of a collaboration between the University of Puerto Rico, Mayaguez, and Puerto Rico's Planning Board. Its main objective is to extract, clean, and process raw Quarterly Census of Employment and Wages (QCEW) data into structured formats optimized for high-performance economic and geographic analytics.
Overview
The pipeline reads raw fixed-width QCEW data from the data/qcew/ directory, processes it through an internal JSON layout schema layout, and caches intermediate tables into structured Parquet files partitioned by year.
The pipeline filters out records from the year 2002 or earlier, casts critical data metrics (employment indices, total/taxable wages), handles missing geospatial parameters safely, and leverages Polars and DuckDB to yield a single, integrated dataset for downstream workflows.
Requirements
To run this tool, you will need the following Python packages:
duckdbpolarsgeopandaspandastqdmlogging
You can install the dependencies via pip:
pip install -r requirements.txt
Or utilize uv to lock and sync your environment instantly:
uv sync
File Structure
The workspace expects files to be organized in the following directory layout:
data/
├── qcew/
│ ├── 2003/
│ │ ├── qcew_file_q1.txt
│ │ └── ...
│ ├── 2004/
│ └── ...
└── processed/
└── qcew/
├── 2003/
│ ├── data-1.parquet
│ └── ...
└── 2004/
data/qcew/: Contains raw text data subfolders partitioned by year. Note: Folders with a year value $\le$ 2002 are automatically skipped by the processing architecture.data/processed/qcew/{year}/: Automatically generated storage location containing clean, structured data chunks saved as individual compressed.parquetfiles.
How It Works
1. Initialization
The class (CleanQCEW) initializes tracking to your storage path, initializes an isolated in-memory duckdb connection session, configures runtime logging, and references the system's package-embedded decode.json structural layout via standard library resources.
2. Fast Fixed-Width Parsing
Raw textual inputs are streamed directly into Polars string blocks using multi-threaded null-byte delimiting, which cuts down overhead compared to standard Python line reading. Using the configuration from decode.json, fields are accurately sliced, cropped of padded spaces, and named.
3. Data Transformation & Alignment
- Columns representing geographical points (
latitude,longitude), indices (year,qtr), and monetary metrics (total_wages,taxable_wages, monthly employment statistics) are cast to optimized types (Float64/Int64) safely without throwing schema exceptions. - Metadata attributes (
file_year,file_qtr) are appended natively before individual files are written out to target Parquet archives on disk.
4. Aggregation and Return
The tool uses an underlying DuckDB instance to query the full tree map of parquet files across all years in parallel, converting the aggregated database response directly into an in-memory pl.DataFrame.
Key Functions
__init__: Sets up pipeline directory mappings, spawns the central connection instance, and loads internal schema rules.make_qcew_dataset: Scans the input directories, runs the validation checks, manages chunk-saving states, and returns the final unified dataset.clean_txt: Performs raw text ingestion and extracts relevant structural fields based on layout specification boundaries.
Usage
- Organize your raw data folders inside your local storage folder (default:
data/qcew/). - Run your pipeline orchestration module:
python main.py
Logging
Operational timelines, warning updates, and operational status parameters are automatically streamed into a file called data_process.log formatted with active millisecond execution timestamps:
20-May-26 07:45:12 - INFO - File data/qcew/2003/raw_data.txt 1 has been inserted into the database.
License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.
Contributing
Contributions to this tool are welcome. Please fork the repository and submit a pull request with any improvements or bug fixes on Codeberg.
Cite
@software{ouslan2026jpqcew,
author = {Ouslan, Alejandro},
title = {JP-QCEW},
month = jan,
year = 2026,
publisher = {Zenodo},
version = {3.0.1},
doi = {10.5281/zenodo.18121581},
url = {https://doi.org/10.5281/zenodo.18121581}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jp_qcew-3.2.2.tar.gz.
File metadata
- Download URL: jp_qcew-3.2.2.tar.gz
- Upload date:
- Size: 156.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d300487e9420dbcefb8cc77ca65460771e44b47dde681fe9d9281e9f4f93895f
|
|
| MD5 |
eed7d81d746940c0a9efe8ef624c6d58
|
|
| BLAKE2b-256 |
d48dfa96ba00b6c2410b78bd0ae4416059b2f617f0fc2ef7303a889b09dc5e81
|
File details
Details for the file jp_qcew-3.2.2-py3-none-any.whl.
File metadata
- Download URL: jp_qcew-3.2.2-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e8dfd67f8c961088eed0304f833120101919e19df89cd5f9ccb9b14abaea411
|
|
| MD5 |
c3507c01a4ddc91aef55eed75ba0a11f
|
|
| BLAKE2b-256 |
fdda63ecaadb0a39e4d078c34b646f0a1e36a0c4d22dd37ffecdda572edc813b
|