Skip to main content

Basic code to obtain probability distribution functions for EoL using TRI data

Project description

TRI4PLADS (FOCAPD SI)

License: MIT PyPI DOI

Project Logo

Overview

This repository contains the code to generate discrete distribution based on TRI data, as part of the FOCAPD 2024 Special Issue invitation.

Project tree

.
├── ancillary
│   ├── cd_is_to_naics.csv
│   ├── tri_file_1a_columns.txt
│   ├── tri_file_1b_columns.txt
│   ├── tri_file_3a_columns.txt
│   └── tri_file_3c_columns.txt
├── conf
│   └── main.yaml
├── tests
├── data
│   ├── processed
│   │   └── tri_eol_additives.sqlite
│   └── raw
│       ├── US_1a_2022.txt
│       ├── US_1b_2022.txt
│       ├── US_3a_2022.txt
│       └── US_3c_2022.txt
└──  src
    ├── __init__.py
    ├── data_processing
    │   ├── __init__.py
    │   ├── create_sqlite_db.py
    │   ├── data_models.py
    │   ├── frs_api_queries.py
    │   ├── base.py
    │   ├── main.py
    │   ├── naics_api_queries.py
    │   └── cdr
    │   │   ├── __init__.py
    │   │   ├── cleaner.py
    │   │   ├── load.py
    │   │   └── orchestator.py
    │   └── tri
    │       ├── __init__.py
    │       ├── load
    │       │   ├── __init__.py
    │       │   └── load.py
    │       ├── orchestator.py
    │       ├── transform
    │       │   ├── __init__.py
    │       │   ├── base.py
    │       │   ├── file_1a.py
    │       │   ├── file_1b.py
    │       │   ├── file_3a.py
    │       │   └── file_3c.py
    │       └── utils.py
    └── generate_analysis
        ├── __init__.py
        ├── main.py
        ├── db_queries.py
        └── interactive_cli.py

Entity relational diagram (ERD)

Project Logo

Requirements

  1. Python >=3.12, <3.13
  2. Poetry

Poetry

New Dependencies

When adding or updating dependencies, run poetry add or poetry update and commit the changes.

pull

When pulling the latest changes, run the following command to ensure that your local environment matches the project's dependencies.

poetry install

Run Commands

To execute commands inside the project's environment, use run as follows:

poetry run python src/main.py

Additionally, you can activate the virtual environment by running the following command:

poetry shell

Pre-commit

Changes

If there is any change in .pre-commit-config.yaml, the following command has to be run:

poetry run pre-commit autoupdate

Pull

Each time you pull changes, run the following command to ensure your local environment is up-to-date:

poetry run pre-commit install

Manually Run Hooks

To manually run all pre-commit hooks on all files in the repository, use the following command:

poetry run pre-commit run --all-files

Note: this is not required when you commit changes.

If you are running the above command or committing your changes, and one or more hooks like black or isort fail, stage their modifications to the git staging area by running git add. After that, you can run commit again.

Installing pyright language server for IDE typecheck highlighting

Detailed instructions: pyright

Pycharm

VSCode: search for Pylance on marketplace

Add path to executable to plugin:

which pyright-langserver

Insert that path to plugin config in your IDE as path to executable

Documentation Style

The project follows the Google style to document the code. The pre-commit hooks are configured to check this style.

Data Source and Processing

Census Bureau Data:

Get your API key in: link

Once you get your API key, include a .env file in the project root with the following:

CENSUS_DATA_API_KEY=<YOUR-CENSUS-DATA-API-KEY>

Replace <YOUR-CENSUS-DATA-API-KEY> with your actual API key.

For more information regarding the API data: link

U.S. EPA's Envirofacts

API documentation: link

Running the Data Processing Pipeline

This repository includes a data processing pipeline for handling TRI (Toxics Release Inventory) data, specifically focusing on plastic additives. The pipeline can be executed by specifying the year of data you want to process.

Running the Script

To run the data processing pipeline, navigate to the repository's main directory and execute the following command, replacing <year> with the desired year (e.g., 2022) and <bool> with True/False:

python src/data_processing/main.py --year <year> --is_drop_nan_percentage <bool>

See the help menu:

python src/data_processing/main.py --help

Changes to the database

If you generate changes to the database schema, create migrations by running:

alembic revision --autogenerate -m "<description-string>"

Then apply the migrations by running:

alembic upgrade head

TODO

TRI data retrieval

The TRI data is static and not dynamic. Due to file size and scalability feel free to automatize this process. Suggestions:

  1. Implement TRI data retrieval from EPA's Envirofacts API.
  2. Implement the web scrapping strategy like in EoL4Chem repository.

Feel free to modularize more the project tree for scalability and mantainability.

SQL database engine

If you will modify the db engine (e.g., PostgreSQL) or name, feel free to include this information in the config file instead of hard coding it since it would be less error prone.

Feel free to use asyncronous queries to reduce the processing time.

Testing

Feel free to use unit or integration testing for QA. As suggestion, include it as a hook in the pre-commit file. Only smoke testing was used in the development of this project and there is not coverage yet.

Data orchestator

Feel free to use a data orchestator like Airflow or Prefect. This would be more important if you try to increase the data volume.

Note

The project structure follows a modular approach to facilitate the expansion and mantainability. In addition, it follows a single responsability principle and separation of concern. Keep this principle as part of good practices and clean code.

PyPI

The project was released as a Python packaged in PyPI.

Disclaimer

The views expressed in this article are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Any mention of trade names, products, or services does not imply an endorsement by the U.S. Government or the U.S. Environmental Protection Agency. The U.S. Environmental Protection Agency does not endorse any commercial products, service, or enterprises.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tri4plads-0.1.0.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tri4plads-0.1.0-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file tri4plads-0.1.0.tar.gz.

File metadata

  • Download URL: tri4plads-0.1.0.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/6.5.0-1025-azure

File hashes

Hashes for tri4plads-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ec56210159da01f84e8f227e4a3a1e93036bb6a18dad8849d932339a17c2b7f5
MD5 27d534995e39c3a9248ca02581288a4c
BLAKE2b-256 4c1478c37d7c5e0d2e341de82ef17a18bc36730390fae93874ca43343e30fa76

See more details on using hashes here.

File details

Details for the file tri4plads-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tri4plads-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/6.5.0-1025-azure

File hashes

Hashes for tri4plads-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 979cb195c9a31a4b9e6725992c4fb61f0bac3c6b649ab24682b48783fc891f88
MD5 ead166803f4ea42e2e927eafdd68ca64
BLAKE2b-256 ee24da26dfb6fdf79fe20c3a584c458da8c73ccab80c89ce93d29ec3915fafb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page