Skip to main content

My installable package for income prediction

Project description

d100_d400_income_predict

Overview

This repository provides a reproducible Docker environment pre-configured with everything needed to run the GLM and LGBM models for predicting high income basaed on the 1994 US census dataset.

The main analysis can be found at: src/notebooks/final_report.ipynb

There is are other sub-analysis files, they are:

  • src/tests/benchmark_pandas_polars.py script that highlights the performance differences between Polars and Pandas on loading and cleaning the dataset.
  • src/notebooks/eda_cleaning.ipynb exploratory data analysis. Lots more charts and info on how and why certain decisions were made in building the models.

Installation

There are two ways to install and run: - Directly from PyPI as a package (easiest) - Docker container (most robust, recommended for development)

Install and Run - Method 1, PyPI

1. Install the package

pip install income_predict_d100_d400

2. Run the Pipeline

python -m income_predict_d100_d400.training_pipeline

or create a your own file and import income_predict_d100_d400:

from income_predict_d100_d400.data import run_data_fetch_pipeline
from income_predict_d100_d400.cleaning import run_cleaning_pipeline
from income_predict_d100_d400.evaluation import run_evaluation
from income_predict_d100_d400.model_training import (
    TARGET,
    load_training_outputs,
    run_split,
    run_training,
)

print("Starting Pipeline...")

file_path = run_data_fetch_pipeline()
df_raw = pd.read_parquet(file_path)
run_cleaning_pipeline(df_raw)
run_split()
run_training()

results = load_training_outputs()

run_evaluation(
    results["test"],
    TARGET,
    results["glm_model"],
    results["lgbm_model"],
    results["train_X"],
)

print("Pipeline finished.")

Install and Run - Method 2, Docker

1. Download and install Docker Desktop (if you don't have it already)

2. Clone the Repository

```
git clone https://github.com/caitpj/d100_d400_income_predict.git
cd d100_d400_income_predict
```

2. Build the Docker Image

`docker build -t conda-uciml .`

3. Run the Model Pipeline

This runs the model in the Docker container, including downloading the data, cleaning, training, tuning, and saving key visualisations. It should take a minuite or so to run.

```
docker run --rm --shm-size=2g \
-e PYTHONWARNINGS=ignore \
-e PYTHONUNBUFFERED=1 \
-e OMP_NUM_THREADS=1 \
conda-uciml python src/income_predict/training_pipeline.py
```

4. Run the final_report.ipynb Notebook

```
docker run --rm -it \
-p 8888:8888 \
-v "$(pwd):/app" \
conda-uciml \
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```

From the output of the above code, find and paste the URL into a browser. It should start with: http://127.0.0.1:8888/?token=...

Development

There are a few more steps needed if you want to develop this repo on your local machine.

To ensure code quality, I use pre-commit hooks that run locally on your machine before every commit. This requires a local Conda environment on your host machine (not in Docker).

1. Download and install Miniconda (if you don't have it already)

  • link: Conda installed on your machine (for local development and git hooks).

2. This installs pre-commit, black, mypy, etc. based on environment.yml

`conda env update --file environment.yml --prune`

3. initialize conda (will need to reset terminal)

`conda init zsh`

4. Activate the environment

`conda activate d100_d300_env`

5. Install the git hooks

`pre-commit install`

Now, every time you run git commit, your local machine will fist check it meets the rules stated in .pre-commit-config.yaml automatically.

AI Use

Some code was AI generated, notably:

  • Visualisations
  • Pandas vs Polars benchmark test

In other areas, AI was used to help with debugging, notably:

  • Docker related issues
  • Performence issues for hypertunning

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

income_predict_d100_d400-0.1.1.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

income_predict_d100_d400-0.1.1-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file income_predict_d100_d400-0.1.1.tar.gz.

File metadata

  • Download URL: income_predict_d100_d400-0.1.1.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for income_predict_d100_d400-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cc32ea4fe30a7b771c0dd12b9addc2aec6c0777017d49459e017b421ae6055f0
MD5 e45ac71e01d0c825600362d923d80a25
BLAKE2b-256 90f5782d994db24d128f70fc7f8ef689d7b6a986fd4bde32d00b7d3f2f2187a8

See more details on using hashes here.

File details

Details for the file income_predict_d100_d400-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for income_predict_d100_d400-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cbc02c5fc5f12633a586cd11ef3248cf35936cc340844168ac2d1e09000e3849
MD5 75c3676921b7325f6188caf165c850f5
BLAKE2b-256 d58b1f9b1e5b62bb416b48dd136414c394211a7942c7d9238f78d327b30a0a1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page