A tool for cleaning and processing CPCB air quality data
Project description
AirPy Tool
A Python package for cleaning and processing CPCB (Central Pollution Control Board) air quality data for official government and research use.
Features
- Flexible Input: Process single files or entire directories
- Multiple Formats: Supports CSV and Excel (XLSX/XLS) files
- Auto-detection: Automatically detects filename format and extracts metadata
- Data Cleaning: Removes outliers, consecutive repeats, and corrects unit inconsistencies
- Unit Standardization: Converts all nitrogen compounds to µg/m³
- Debug-friendly: Verbose mode for troubleshooting
Installation
From PyPI
pip install airpy-tool
From GitHub
pip install git+https://github.com/chandankr014/airpy-tool.git
Local Development
git clone https://github.com/chandankr014/airpy-tool.git
cd airpy-tool
pip install -e .
Quick Start
Command Line (CLI)
# Process a single file
airpy --input data/raw/site_5112_2024.csv --output data/clean/
# Process all files in a folder
airpy --input data/raw/ --output data/clean/
# With verbose output for debugging
airpy --input data/raw/ --output data/clean/ --verbose
# Process specific pollutants only
airpy --input data/raw/ --output data/clean/ --pollutants PM25 PM10
# Filter by city
airpy --input data/raw/ --output data/clean/ --city Delhi
# Overwrite existing files
airpy --input data/raw/ --output data/clean/ --overwrite
Python API
from airpy.core.processor import process_data
# Process a single file
df = process_data(
input_path="data/raw/site_5112_2024.csv",
output_path="data/clean/"
)
# Process all files in a folder
process_data(
input_path="data/raw/",
output_path="data/clean/"
)
# With all options
process_data(
input_path="data/raw/",
output_path="data/clean/",
city="Delhi", # Filter by city
pollutants=["PM25", "PM10", "NO2"], # Specific pollutants
verbose=True, # Debug output
overwrite=True # Replace existing files
)
CLI Arguments Reference
| Argument | Short | Description |
|---|---|---|
--input |
-i |
Path to input file or directory (required) |
--output |
-o |
Path to output file or directory (required) |
--city |
Filter processing to a specific city | |
--live |
Process live data format filenames | |
--pollutants |
List of pollutants to process | |
--siteid-position |
Custom site ID position [start, end] | |
--overwrite |
Overwrite existing output files | |
--verbose |
-v |
Enable verbose/debug output |
--version |
Show version number |
Supported File Formats
AirPy automatically detects these filename formats:
| Format | Example |
|---|---|
| Site format | site_5112_2024.csv |
| Numeric format | 5112_2024.csv |
| 15min format | 15Min_2020_site_5111_station_name.csv |
| Raw data format | Raw_data_15Min_2020_site_5111_name.csv |
| Live format | site_5111202012251200000.xlsx |
Output Columns
After processing, the cleaned data includes:
Standard Cleaned Columns
PM25_clean- PM2.5 concentrations (µg/m³)PM10_clean- PM10 concentrations (µg/m³)Ozone_clean- Ozone concentrations (µg/m³)
Unit-Corrected Nitrogen Compounds
NO_CPCB- Nitric oxide (µg/m³)NO2_CPCB- Nitrogen dioxide (µg/m³)NOx_CPCB- Total nitrogen oxides (µg/m³)
Data Cleaning Process
- Data Formatting: Standardizes column names and timestamps
- Consecutive Repeat Detection: Removes stuck sensor readings
- Outlier Detection: Uses IQR and MAD methods
- Unit Correction: Standardizes NO/NO2/NOx to µg/m³
- Gap Interpolation: Fills small gaps in data
For detailed documentation, see Documentation.md.
CPCB Data Access
Download CPCB state and city-wise air quality data: CPCB Data Repository
Troubleshooting
Common Issues
No files found
# Check if your files have supported extensions (.csv, .xlsx, .xls, .txt)
# Use verbose mode to see what's happening
airpy --input data/raw/ --output data/clean/ --verbose
Metadata extraction fails
# Use custom site ID position if your filename format is non-standard
airpy --input data/raw/ --output data/clean/ --siteid-position 1 2
Missing pollutant data
# Check which pollutants exist in your data
# Process specific available pollutants only
airpy --input data/raw/ --output data/clean/ --pollutants PM25 PM10
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this tool in your research, please cite:
AirPy - CPCB Air Quality Data Processing Tool
https://github.com/chandankr014/airpy-tool
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file airpy_tool-2.0.0.tar.gz.
File metadata
- Download URL: airpy_tool-2.0.0.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afaf40e8b4650d5dd5f5681acca40c815c136ac1b4453d21139023d863216813
|
|
| MD5 |
3dd12119bc9f1b6d62ef0cd87f933c31
|
|
| BLAKE2b-256 |
0d79f76207a2635e699007cc6fb4f6a42aac2bb3cdb02594553d58dc0a90dee1
|
File details
Details for the file airpy_tool-2.0.0-py3-none-any.whl.
File metadata
- Download URL: airpy_tool-2.0.0-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
501d3d589382727982c175a1fff068cd4676d84ea0106db3fac78901d9368217
|
|
| MD5 |
dbde640a260d64050d29c78638c4540b
|
|
| BLAKE2b-256 |
415b824654d255e4d07082cdafe4b193264ecb3abf8f532c6da1bc5a4f64f409
|