Skip to main content

A tool for cleaning and processing CPCB air quality data

Project description

AirPy Tool

A Python package for cleaning and processing CPCB (Central Pollution Control Board) air quality data for official government and research use.

Python 3.8+ License: MIT

Features

  • Flexible Input: Process single files or entire directories
  • Multiple Formats: Supports CSV and Excel (XLSX/XLS) files
  • Auto-detection: Automatically detects filename format and extracts metadata
  • Data Cleaning: Removes outliers, consecutive repeats, and corrects unit inconsistencies
  • Unit Standardization: Converts all nitrogen compounds to µg/m³
  • Debug-friendly: Verbose mode for troubleshooting

Installation

From PyPI

pip install airpy-tool

From GitHub

pip install git+https://github.com/chandankr014/airpy-tool.git

Local Development

git clone https://github.com/chandankr014/airpy-tool.git
cd airpy-tool
pip install -e .

Quick Start

Command Line (CLI)

# Process a single file
airpy --input data/raw/site_5112_2024.csv --output data/clean/

# Process all files in a folder
airpy --input data/raw/ --output data/clean/

# With verbose output for debugging
airpy --input data/raw/ --output data/clean/ --verbose

# Process specific pollutants only
airpy --input data/raw/ --output data/clean/ --pollutants PM25 PM10

# Filter by city
airpy --input data/raw/ --output data/clean/ --city Delhi

# Overwrite existing files
airpy --input data/raw/ --output data/clean/ --overwrite

Python API

from airpy.core.processor import process_data

# Process a single file
df = process_data(
    input_path="data/raw/site_5112_2024.csv",
    output_path="data/clean/"
)

# Process all files in a folder
process_data(
    input_path="data/raw/",
    output_path="data/clean/"
)

# With all options
process_data(
    input_path="data/raw/",
    output_path="data/clean/",
    city="Delhi",                          # Filter by city
    pollutants=["PM25", "PM10", "NO2"],     # Specific pollutants
    verbose=True,                           # Debug output
    overwrite=True                          # Replace existing files
)

CLI Arguments Reference

Argument Short Description
--input -i Path to input file or directory (required)
--output -o Path to output file or directory (required)
--city Filter processing to a specific city
--live Process live data format filenames
--pollutants List of pollutants to process
--siteid-position Custom site ID position [start, end]
--overwrite Overwrite existing output files
--verbose -v Enable verbose/debug output
--version Show version number

Supported File Formats

AirPy automatically detects these filename formats:

Format Example
Site format site_5112_2024.csv
Numeric format 5112_2024.csv
15min format 15Min_2020_site_5111_station_name.csv
Raw data format Raw_data_15Min_2020_site_5111_name.csv
Live format site_5111202012251200000.xlsx

Output Columns

After processing, the cleaned data includes:

Standard Cleaned Columns

  • PM25_clean - PM2.5 concentrations (µg/m³)
  • PM10_clean - PM10 concentrations (µg/m³)
  • Ozone_clean - Ozone concentrations (µg/m³)

Unit-Corrected Nitrogen Compounds

  • NO_CPCB - Nitric oxide (µg/m³)
  • NO2_CPCB - Nitrogen dioxide (µg/m³)
  • NOx_CPCB - Total nitrogen oxides (µg/m³)

Data Cleaning Process

  1. Data Formatting: Standardizes column names and timestamps
  2. Consecutive Repeat Detection: Removes stuck sensor readings
  3. Outlier Detection: Uses IQR and MAD methods
  4. Unit Correction: Standardizes NO/NO2/NOx to µg/m³
  5. Gap Interpolation: Fills small gaps in data

For detailed documentation, see Documentation.md.

CPCB Data Access

Download CPCB state and city-wise air quality data: CPCB Data Repository

Troubleshooting

Common Issues

No files found

# Check if your files have supported extensions (.csv, .xlsx, .xls, .txt)
# Use verbose mode to see what's happening
airpy --input data/raw/ --output data/clean/ --verbose

Metadata extraction fails

# Use custom site ID position if your filename format is non-standard
airpy --input data/raw/ --output data/clean/ --siteid-position 1 2

Missing pollutant data

# Check which pollutants exist in your data
# Process specific available pollutants only
airpy --input data/raw/ --output data/clean/ --pollutants PM25 PM10

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

AirPy - CPCB Air Quality Data Processing Tool
https://github.com/chandankr014/airpy-tool

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airpy_tool-2.0.0.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airpy_tool-2.0.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file airpy_tool-2.0.0.tar.gz.

File metadata

  • Download URL: airpy_tool-2.0.0.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for airpy_tool-2.0.0.tar.gz
Algorithm Hash digest
SHA256 afaf40e8b4650d5dd5f5681acca40c815c136ac1b4453d21139023d863216813
MD5 3dd12119bc9f1b6d62ef0cd87f933c31
BLAKE2b-256 0d79f76207a2635e699007cc6fb4f6a42aac2bb3cdb02594553d58dc0a90dee1

See more details on using hashes here.

File details

Details for the file airpy_tool-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: airpy_tool-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for airpy_tool-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 501d3d589382727982c175a1fff068cd4676d84ea0106db3fac78901d9368217
MD5 dbde640a260d64050d29c78638c4540b
BLAKE2b-256 415b824654d255e4d07082cdafe4b193264ecb3abf8f532c6da1bc5a4f64f409

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page