My installable package for income prediction
Project description
d100_d400_income_predict
Overview
This repository provides a reproducible Docker environment pre-configured with everything needed to run the GLM and LGBM models for predicting high income basaed on the 1994 US census dataset.
The main analysis can be found at: src/notebooks/final_report.ipynb
There is are other sub-analysis files, they are:
src/tests/benchmark_pandas_polars.pyscript that highlights the performance differences between Polars and Pandas on loading and cleaning the dataset.src/notebooks/eda_cleaning.ipynbexploratory data analysis. Lots more charts and info on how and why certain decisions were made in building the models.
Running the model
1. Download and install Docker Desktop (if you don't have it already)
- link: Docker Desktop
2. Clone the Repository
```
git clone https://github.com/caitpj/d100_d400_income_predict.git
cd d100_d400_income_predict
```
2. Build the Docker Image
`docker build -t conda-uciml .`
3. Run the Model Pipeline
This runs the model in the Docker container, including downloading the data, cleaning, training, tuning, and saving key visualisations. It should take a minuite or so to run.
```
docker run --rm --shm-size=2g \
-e PYTHONWARNINGS=ignore \
-e PYTHONUNBUFFERED=1 \
-e OMP_NUM_THREADS=1 \
conda-uciml python src/income_predict/training_pipeline.py
```
4. Run the final_report.ipynb Notebook
```
docker run --rm -it \
-p 8888:8888 \
-v "$(pwd):/app" \
conda-uciml \
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
From the output of the above code, find and paste the URL into a browser. It should start with: http://127.0.0.1:8888/?token=...
Development
There are a few more steps needed if you want to develop this repo on your local machine.
To ensure code quality, I use pre-commit hooks that run locally on your machine before every commit. This requires a local Conda environment on your host machine (not in Docker).
1. Download and install Miniconda (if you don't have it already)
- link: Conda installed on your machine (for local development and git hooks).
2. This installs pre-commit, black, mypy, etc. based on environment.yml
`conda env update --file environment.yml --prune`
3. initialize conda (will need to reset terminal)
`conda init zsh`
4. Activate the environment
`conda activate d100_d300_env`
5. Install the git hooks
`pre-commit install`
Now, every time you run git commit, your local machine will fist check it meets the rules stated in .pre-commit-config.yaml automatically.
AI Use
Some code was AI generated, notably:
- Visualisations
- Pandas vs Polars benchmark test
In other areas, AI was used to help with debugging, notably:
- Docker related issues
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file income_predict_d100_d400-0.1.0.tar.gz.
File metadata
- Download URL: income_predict_d100_d400-0.1.0.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25e3d6a42387c596af619bd06797529eb224fe412cdc0c1ee56b4790b77e2d5c
|
|
| MD5 |
fcb470aaf19024d5cbf788eaccb696df
|
|
| BLAKE2b-256 |
7a8b7f625ef5b6b5608885cc342d89aef260dfe8cbce78f30f5e38705811e148
|
File details
Details for the file income_predict_d100_d400-0.1.0-py3-none-any.whl.
File metadata
- Download URL: income_predict_d100_d400-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38712b191ffe926c503e2eeee6f4cfd0a34dc284f2330b9d1a6ee0a5322c6866
|
|
| MD5 |
72ec5d9ea98c55d39f4f225af39f6502
|
|
| BLAKE2b-256 |
2d5fad7303e52175a273d33811441d49591f04390a34aaa3a8ade3d463536880
|