Skip to main content

My installable package for income prediction

Project description

d100_d400_income_predict

Overview

This repository provides a reproducible Docker environment pre-configured with everything needed to run the GLM and LGBM models for predicting high income basaed on the 1994 US census dataset.

The main analysis can be found at: src/notebooks/final_report.ipynb

There is are other sub-analysis files, they are:

  • src/tests/benchmark_pandas_polars.py script that highlights the performance differences between Polars and Pandas on loading and cleaning this specific dataset.
  • src/notebooks/eda_cleaning.ipynb exploratory data analysis. Lots more charts and info on how and why certain decisions were made in building the models.

Installation

There are two ways to install and run:

  • Directly from PyPI as a package (easiest)
  • Docker container (most robust, recommended for development)

Install and Run - Method 1, PyPI

1. Install the package

pip install income_predict_d100_d400

2. Run the Pipeline

python -m income_predict_d100_d400.pipeline

or create your own file and import income_predict_d100_d400:

from income_predict_d100_d400 import (
    TARGET,
    load_data,
    load_training_outputs,
    run_cleaning_pipeline,
    run_evaluation,
    run_split,
    run_training,
)

print("Starting Pipeline...")

file_path = load_data()
df_raw = pd.read_parquet(file_path)
run_cleaning_pipeline(df_raw)
run_split()
run_training()

results = load_training_outputs()

run_evaluation(
    results["test"],
    TARGET,
    results["glm_model"],
    results["lgbm_model"],
    results["train_features"],
)

print("Pipeline finished.")

Install and Run - Method 2, Docker

1. Download and install Docker Desktop (if you don't have it already)

2. Clone the Repository

git clone https://github.com/caitpj/d100_d400_income_predict.git
cd d100_d400_income_predict

2. Build the Docker Image

docker build -t conda-uciml . (from root of d100_d400_income_predict)

3. Run the Model Pipeline

This runs the model in the Docker container, including downloading the data, cleaning, training, tuning, and saving key data files and visualisations. It should take a minuite or so to run.

docker run --rm --shm-size=2g \
-v "$(git rev-parse --show-toplevel):/app" \
-e PYTHONUNBUFFERED=1 \
-e OMP_NUM_THREADS=1 \
conda-uciml python src/income_predict_d100_d400/pipeline.py

4. Run Notebooks

docker run --rm -it \
-v "$(git rev-parse --show-toplevel):/app" \
-p 8888:8888 conda-uciml \
jupyter notebook --ip=0.0.0.0 \
--port=8888 --no-browser --allow-root

From the output of the above code, find and paste the URL into a browser. It should start with: http://127.0.0.1:8888/?token=...

Extra Steps for Development

If you want to contribute or modify the code, you can run a development shell inside the Docker container. This ensures you are using the exact same environment as the production build.

  1. Enter the Development Shell This command mounts your local current directory to the container. Any changes you make to the code in your local editor will be instantly visible inside the container.
docker run --rm -it \
  -v "$(git rev-parse --show-toplevel):/app" \
  conda-uciml \
  /bin/bash
  1. Activate the Environment Once inside the container, activate the specific environment: conda activate d100_d400_env

  2. Run Tests & Checks Since you are developing inside the container, you should run the quality checks manually before committing your code:

    • Run Unit Tests: pytest
    • Run Pre-commit Checks (Linting/Formatting): pre-commit run --all-files
    • Run Benchmark Tests, e.g.: python src/benchmarks/benchmark_csv_parquet.py

AI Use

Some code was AI generated, notably:

  • Visualisations
  • Refactor from Pandas to Polars
  • Pretty terminal outputs
  • Full docstrings

In other areas, AI was used to help with debugging, notably:

  • Docker related issues
  • Performence issues with hypertunning

All code generated from AI is understood and reviewed by the author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

income_predict_d100_d400-1.0.0.tar.gz (34.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

income_predict_d100_d400-1.0.0-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file income_predict_d100_d400-1.0.0.tar.gz.

File metadata

  • Download URL: income_predict_d100_d400-1.0.0.tar.gz
  • Upload date:
  • Size: 34.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for income_predict_d100_d400-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9f665b15ba79ac47d71a469c829170435883626ea967bd285ae91e6120f8fec0
MD5 6a2a314ddc0b0bc7ec6d294956f7c5b4
BLAKE2b-256 dd8f260b1b95d85ca696c5143fe8aef37c4b679074eb2a536628fa472bfacbcb

See more details on using hashes here.

File details

Details for the file income_predict_d100_d400-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for income_predict_d100_d400-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95cb3b18f24a3be1e6a1be922990eb4e8a2a939538295535a3fa08ac3242c4fb
MD5 8645a3a5590d710491e3e39c4c398e93
BLAKE2b-256 1a36157b5c25f224300f9a73b92c4ec5fc579b18a67cbf0190084b2c0b047a4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page