My installable package for income prediction

Project description

d100_d400_income_predict

Overview

This repository provides a reproducible Docker environment pre-configured with everything needed to run the GLM and LGBM models for predicting high income basaed on the 1994 US census dataset.

The main analysis can be found at: src/notebooks/final_report.ipynb

There is are other sub-analysis files, they are:

src/tests/benchmark_pandas_polars.py script that highlights the performance differences between Polars and Pandas on loading and cleaning this specific dataset.
src/notebooks/eda_cleaning.ipynb exploratory data analysis. Lots more charts and info on how and why certain decisions were made in building the models.

Installation

There are two ways to install and run:

Directly from PyPI as a package (easiest)
Docker container (most robust, recommended for development)

Install and Run - Method 1, PyPI

1. Install the package

pip install income_predict_d100_d400

2. Run the Pipeline

python -m income_predict_d100_d400.training_pipeline

or create a your own file and import income_predict_d100_d400:

from income_predict_d100_d400.data import load_data
from income_predict_d100_d400.cleaning import run_cleaning_pipeline
from income_predict_d100_d400.evaluation import run_evaluation
from income_predict_d100_d400.model_training import (
    TARGET,
    load_training_outputs,
    run_split,
    run_training,
)

print("Starting Pipeline...")

file_path = load_data()
df_raw = pd.read_parquet(file_path)
run_cleaning_pipeline(df_raw)
run_split()
run_training()

results = load_training_outputs()

run_evaluation(
    results["test"],
    TARGET,
    results["glm_model"],
    results["lgbm_model"],
    results["train_features"],
)

print("Pipeline finished.")

Install and Run - Method 2, Docker

1. Download and install Docker Desktop (if you don't have it already)

link: Docker Desktop

2. Clone the Repository

git clone https://github.com/caitpj/d100_d400_income_predict.git
cd d100_d400_income_predict

2. Build the Docker Image

docker build -t conda-uciml . (from root of d100_d400_income_predict)

3. Run the Model Pipeline

This runs the model in the Docker container, including downloading the data, cleaning, training, tuning, and saving key visualisations. It should take a minuite or so to run.

docker run --rm --shm-size=2g \
-e PYTHONWARNINGS=ignore \
-e PYTHONUNBUFFERED=1 \
-e OMP_NUM_THREADS=1 \
conda-uciml python src/income_predict_d100_d400/training_pipeline.py

4. Run Notebooks

docker run --rm -it \
-p 8888:8888 \
-v "$(pwd):/app" \
conda-uciml \
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root

From the output of the above code, find and paste the URL into a browser. It should start with: http://127.0.0.1:8888/?token=...

Extra Steps for Development

There are a few more steps needed if you want to develop this repo on your local machine.

To ensure code quality, I use pre-commit hooks that run locally on your machine before every commit. As well as pytest for unit tests. These require a local Conda environment on your host machine (not in Docker).

1. Download and install Miniconda (if you don't have it already)

link: Conda installed on your machine (for local development and git hooks).

2. Install required packages based on environment.yml

conda env update --file environment.yml --prune

3. initialize conda (will need to reset terminal)

conda init zsh

4. Activate the environment

conda activate d100_d300_env

5. Install the git hooks

pre-commit install

Now, every time you run git commit, your local machine will fist check it meets the rules stated in .pre-commit-config.yaml automatically. You can also run pytests in the src/tests directory.

AI Use

Some code was AI generated, notably:

Visualisations
Pandas vs Polars benchmark test
Pretty terminal outputs
Full docstrings

In other areas, AI was used to help with debugging, notably:

Docker related issues
Performence issues with hypertunning

All code generated from AI is understood and reviewed by the author.

Project details

Release history Release notifications | RSS feed

1.0.0

Dec 17, 2025

0.1.4

Dec 16, 2025

0.1.3

Dec 14, 2025

This version

0.1.2

Dec 14, 2025

0.1.1

Dec 13, 2025

0.1.0

Dec 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

income_predict_d100_d400-0.1.2.tar.gz (30.5 kB view details)

Uploaded Dec 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

income_predict_d100_d400-0.1.2-py3-none-any.whl (37.9 kB view details)

Uploaded Dec 14, 2025 Python 3

File details

Details for the file income_predict_d100_d400-0.1.2.tar.gz.

File metadata

Download URL: income_predict_d100_d400-0.1.2.tar.gz
Upload date: Dec 14, 2025
Size: 30.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for income_predict_d100_d400-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3fa628fd443467ef921fb1e9cdfe4c4930c8fca4aeb8196773f40f0e6b9f3601`
MD5	`2ace0b18b92c6a1d58b1a8e38f57a4fa`
BLAKE2b-256	`4a3ea0d20e588fce948f441485453ae5c17aa89fd853be131394f187f6e66ebe`

See more details on using hashes here.

File details

Details for the file income_predict_d100_d400-0.1.2-py3-none-any.whl.

File metadata

Download URL: income_predict_d100_d400-0.1.2-py3-none-any.whl
Upload date: Dec 14, 2025
Size: 37.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for income_predict_d100_d400-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4f7f52d4ae7664bb5824a9367da613e29ceb5b5bdac3b163d133f934ee98ed88`
MD5	`29797deeb3d02ce1d593b0f2a8299d0d`
BLAKE2b-256	`fcd556cd446c33307fde75986c055b7a6d2f43aa681fa24d1047ecf38f2653b9`

See more details on using hashes here.

income-predict-d100-d400 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

d100_d400_income_predict

Overview

Installation

Install and Run - Method 1, PyPI

1. Install the package

2. Run the Pipeline

Install and Run - Method 2, Docker

1. Download and install Docker Desktop (if you don't have it already)

2. Clone the Repository

2. Build the Docker Image

3. Run the Model Pipeline

4. Run Notebooks

Extra Steps for Development

1. Download and install Miniconda (if you don't have it already)

2. Install required packages based on environment.yml

3. initialize conda (will need to reset terminal)

4. Activate the environment

5. Install the git hooks

AI Use

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes