A pip-installable pipeline which loads 2009 FM data into an XGBoost model instance
Project description
Welcome to Risknet! This is a downloadable Pip package where you can access and run an XGBoost pipeline.
Folder/File Layout Layout
- src/risknet
config
: holdsconf.yaml
conf.yaml
: helps with setuphandlers.py
data
: empty folder where user can store FM datasetjobs
cloud_etl.py
: helps save files to cloud
proc
: contains preprocessing steps like feature encoding, label prep, and train-test splitsencoder.py
: feature engineering/encoding categoricalslabel_prep.py
: defines default, progress on loanreducer.py
: reduces features based on importance, also train/test/val splits -run
: contains files for running pipelinemain.py
: currently empty. WIP define pipeline as a function and call heremodel.py
: defines the model class + functionspipeline.py
: calls functions to execute the pipeline
sys
: contains files to set up system environment and logginglog.py
: sets up loggermanagers.py
: sets up the dask managersystem.py
: defines creating and removing files via thesys
package -main.py
: logs start, stop time for running the program (including downloading packages from setup.cfg)
tests
: store tests here -test_stub.py
: currently only asserts True == True. No tests added yet.
Running The Code
Currently, this code is hosted on testpypi, a Test version of the Python Packaging Index. You can see package documentation here.
As described in the website above, you can access the code by running pip install -i https://test.pypi.org/simple/ risknet
on your local computer. Ensure pip is updated.
- Note: this code might not work on your computer as many dependencies of the package like pandas are not available for download via testpypi
To run a specific part of the code, use risknet.utils.{feature_name} as is standard practice when accessing Python packages.
Reproducibility Information:
Accessing Data
You may want to access our base dataset for reference purposes. Here's how you can do that:
In this study we will use the Freddie Mac Single-Family Loan Dataset to run code. Specifically, we will use the 2009_Q1 dataset.
- Go to this link here. This will redirect you to the Freddie Mac site.
- Scroll down until you reach a table that says "Loan-Level Dataset Files". Download the standard quarterly dataset.
- Submit necessary education credentials including name, email, and reason for accessing the files. There should not be a payment step. The site will email you a username and password.
- Reload the page and log in with your new email/password credentials.
- Download the Quarter 1 data from the year 2009. You will receive a .zip file in your Downloads folder.
If you unzip the file, you will see multiple files including a "date_time" file and a "data" file. 6. Save these files into the src/data folder in a local copy of this repository
Dependencies
You can find a list of this package's dependencies inside the file called setup.cfg
.
In summary, the downloads needed for this code are:
- numpy==1.26.1
- pandas==2.1.2
- dask[complete]==2023.10.1
- xgboost==2.0.1
- PyYAML
- types-PyYAML
- pyarrow
- fastparquet
- pytest
- pytest-cov
- mypy
- flake8
Steps to Update Version on TestPyPi
To update the version:
- Reset the code from the previous version (if necessary)
rm -rf dist build
to remove build folder- manually remove "egg-info" folder. This will change
src
tosrc/risknet
.
- Update setup.cfg's version number depending on if major, minor, or bug change
- Rerun
python3 -m build
(you should get a new dist folder + egg folder in \src)- THIS SHOULD CREATE A NEW binary file where version is UPDATED
- Make sure you're in the same directory as your setup.cfg when you run this command.
- Rerun
python3 -m twine upload --repository testpypi dist/*
- Username:
__token__
- Password: [testpypi password starting with pypi]
- If you did NOT update the version # before running
build
then you will get an error
- Username:
When Uploading to PyPi:
Repeat steps above with these important differences:
- Use
python3 -m twine upload dist/*
to upload to PyPi. You do not need to specify--repository testpypi
when uploading to PyPi. - Login username will be the same. However, remember to use PyPi's login password/API token, not TestPyPi's token for the password.
Package Version History Documentation:
0.0.1: Ran into problems with installing pytest-cov
0.0.2: Got error:
ERROR: Could not find a version that satisfies the requirement dask[complete] (from risknet) (from versions: none) ERROR: No matching distribution found for dask[complete]
0.0.3: Tried moving dask into [options.extras_require] not install_requires. Got error for flake8
0.0.4: Commented out more packages.
Got error: error: package directory 'lib3/yaml' does not exist
and metadata-generation-failed
.
0.0.5: Moved types-YAML into options, still get 'lib3/yaml does not exist'
error :/
0.0.6: we're removing YAML as a test
Got error ERROR: Could not find a version that satisfies the requirement typing (from risknet) (from versions: none) ERROR: No matching distribution found for typing
0.0.10: reverting to version 0.0.6, checking if it works --success! But still error No matching distribution found for typing
0.0.11: what happens when we remove typing (since it's part of stdlib in python >3.5)?
0.0.12: commented out all packages added by EC
0.0.13: Got new error: `ERROR: Cannot install pandas==1.3.4 and risknet==0.0.12 because these package versions have conflicting dependencies.
The conflict is caused by: risknet 0.0.12 depends on numpy pandas 1.3.4 depends on numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10"`
Solution: will try setting python > 3.10
SUCCESS! Version 0.0.13 can download from emily's (base) Caveats:
- Probably can only work because it has all dependencies already installed in the env (it threw an error when I tried to run it in risknet_test)
- Can only import
risknet.utils.label_prep
sincerisknet.utils.encoder
, etc. have local imports to different .py files which Python can't read (??) But Running>>> import risknet.utils.label_prep as label_prep, >>> label_prep.label_proc(fm_root, data)
works!!
0.0.14: change setup.py to if __name__ == "__main__: setup()
.
0.0.16: try compiling on base environment (python 3.12, pip 23.2)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file risknet-1.0.17.tar.gz
.
File metadata
- Download URL: risknet-1.0.17.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16629f86885a98b39e2d1e17c2dd271b7013bbc947d3fcced89560d3a7f3045a |
|
MD5 | 93bcce2721cc083bc658c4cdd7bd328d |
|
BLAKE2b-256 | 8ad36fce826b2f4ed0419083f7ba186729e76683a873ec64e48e39eca10983dc |
File details
Details for the file risknet-1.0.17-py2.py3-none-any.whl
.
File metadata
- Download URL: risknet-1.0.17-py2.py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4044ae2daccc17273f273069a3e188ac7ff48a4df58f1bd4557bbf136ab0bc84 |
|
MD5 | 5916a7be2592968f6e1aa759abb626ed |
|
BLAKE2b-256 | 2b602ccdfe9eb523716203744ba9431aadd5a425ec2b1221d3edc44514e595aa |