Composable components for basic ML tasks
Project description
thds.mllegos Library
Composable components for basic ML tasks
Why this library exists
At Trilliant Health we operate in one primary domain - medical claims data - so we have a lot of shared concepts across ML projects. Hence, it would be nice to gather shared utilities related to doing ML on that data in one place, and not have to reinvent many wheels. This library is meant to be that place.
Modularity and composability
As the name would suggest, mllegos is meant to be home for small, modular, composable components. You
should be able to glue them together into new conglomerations that no one has thought of before - like
legos! If you're adding something, make sure it's small and does one thing well. Ask yourself the
following about any new lego blocks you want to add:
- Does my lego block do one thing well?
- Does it compose well with other lego blocks in the library?
- Does it work on a simple set of inputs of well-known type, not requiring the user to jump through hoops to get data into the right shape?
- Does it have esoteric dependencies that would make it hard to use in a different context?
All the legos in the library strive to fulfill these criteria. When the lego blocks are small and do one thing well, they're easier to test, maintain, understand, and use in new and unexpected settings.
A sampling of the current legos
Eval
sklegos.eval contains the following legos:
cls_reportcontains helper functions for working with the dict output ofsklearn.metrics.classification_reportto_pandasturns it into a pandas DataFramemulticlass_performance_vizcreates an interactive scatterplot of performance metrics (precision/recall/f1-score/support) for all classes
viz.basiccontains some slighlty lower-level wrappers for making more custom scatterplots and bar charts withpyecharts- see
notebooks/demo_sklearn_classification_report_tools.ipynbfor working examples of all of the above
Feature extraction
sklegos.feature_extraction contains the following legos:
OptimizedPrefixEncodingis useful when dealing with coding systems which use string structure to encode hierarchy membership from left to right (e.g. ICD-10 diagnoses/procedures, NUCC taxonomies). It dynamically estimates code vocabularies by aggregating evidence from rare codes to shared prefixes with better support.
Feature selection
sklegos.feature_selection contains the following legos:
-
DynamicFeatureSelectoris useful when either:- you have a large number of features which correlate with your target, but you can't afford to fit a model with all of them
- you have no features or very few features which correlate with your target
This lego will first try your feature selector of choice, and if that returns too few or too many features, will either select some of the remaining features, or further filter the selected features, according to a criterion of your choice.
Imputation
sklegos.imputation contains the following legos:
-
ConditionalImputeris a simple imputation scheme for filling in missing values of a continuous feature using point estimates from their conditional distributions given other discrete features. Useful when:- your model in incompatible with missing values
- the discrete conditioning features are fine-grained enough to provide some specificity about the continuous feature
- but with good enough support to enable a robust estimate.
Example: imputing missing patient age in claims using a combination of payer type (medicare, medicaid, commercial, etc), sex, and provider specialty.
Modeling
sklegos.modeling contains the following legos:
-
DiscreteFeatureSplitis a meta-estimator that fits a separate instance of an estimator of a fixed architecture on each subset of the training data that corresponds to a unique value of a discrete feature. Useful when:- the meaning and utility of many other features vary significantly depending on the value of the discrete feature
- the discrete feature has enough support in many values to enable robust estimation of the model on each subset.
Example: given a classification task to determine if an ambiguous DRG code is MS or AP, fit a separate classifier for each distinct DRG code. Criterion 1 is satisfied often since specific diagnoses or patient demographics are highly discriminative for specific DRGs but not globally. E.g. age and sex are not highly discriminative globally but are very useful for distinguishing AP 775 (alcohol abuse/dependence) from MS 775 (vaginal delivery).
-
TreeStructuredLabelsClassifieris a meta-estimator that fits a separate instance of a classifier for each internal node of an arbitrary tree structure defined on labels (not features). The classifier should estimate conditional probabilities of the class given the features (i.e. it should support apredict_probamethod). The meta-classifier estimates class probabilities by applying the chain rule of probability to the sub-classifier probabilities along the tree structure. Counterintuitively, this can result in more accurate predictions and also a large reduction of computational cost, depending on the size of the training set and the cardinality of the label set. ASubtreeExecutorcallback enables pluggable parallelism — subtree fits can be dispatched to remote nodes, threads, or run sequentially (the default). A presentation of the idea and some initial research results on toy problems can be found here. Useful when:- the labels targeted by the classification task have a known hierarchical semantic structure (e.g. ICD-10 diagnoses/procedures, NUCC taxonomies)
- the cardinality of the labels and the size of the dataset are such that the computational cost of training classifier for each label is prohibitive.
Ideas for future improvement: It would be nice to be able to learn a tree structure when there is no standard or explicit one available for the target classes. The point of the meta-classifier is to simplify the problem by decomposing it - why not use the confusion matrix from an initial global classifier to hierarchically group classes into groups which are highly confusable when tossed in with all the others, but might be more distinguishable when considered in isolation?
Structure of the library
thds.mllegos was started with the following initial submodules:
feature_extractionis meant to contain generalized feature extraction utils.- code in here should be agnostic to any particular 3rd party library or framework. The idea is that you could plug it in to some code that does use such a framework, but you're leaving that interface as an extra layer of abstraction.
sklegosis meant to contain anysklearn-compatible interfaces for e.g. feature selection, ensembling, feature extraction, imputation, etc.- Right now it contains
feature_extraction,imputation, andmodelingsubmodules, which are meant to contain utilities for those tasks - Most things in here should inherit from something in
sklearn.base(e.g.BaseEstimator), and be compatible with thesklearnAPI
- Right now it contains
util, as the name suggests, is meant to be a grab bag.- The main requirements for adding things there are that
- most things in there should be pure python with few if any 3rd party library dependencies
- new additions should be in clearly named submodules that encapsulate small modular bits of functionality (well, really that goes for the whole library 😄)
- Right now for instance, there are some type aliases, high-level functional tools, and a couple data structures for working with trees and heaps
- The main requirements for adding things there are that
Note that, while most of the things I have implemented so far are sklearn-compatible, that's by no
means a requirement for additions to the library! Hence why I explicitly created an sklegos submodule
to put those things in, rather than just including them as root submodules. In fact, I didn't even want
sklearn to be a required transitive dependency of the library, hence my addition of an
sklearn extra
in the package spec - you have to explicitly opt in to bring on sklearn as a transitive dependency for
your project. I imagine in the future we could have similar submodules for other ML frameworks, e.g. a
torch-legos submodule (or firebricks? Naming things is fun!).
Feel free to add more submodules as needed (guidelines here)! If there's something you'd like to see added to the libary, but you don't have time to get to it right away, feel free to add it to the wish list.
Contributing
If you have a small piece of code that you think would be useful to others, feel free to add it! Just follow the guidelines below.
We maintain a standing wish-list of features we'd like to see added to the library in this issue. If you're adding something, check to see if it's on that list so you can cross it off!
Tests
Most utils in the current library have good test coverage at this point. ML utils can be a bit delicate and nuanced, since they have to handle lots of strange cases that occur out in the wild world of real data. If you add anything marginally complicated, it should have unit tests that exercise various edge cases.
Documentation
Public-facing functions and classes should have docstrings. The main purpose of the component should be documented, and each of its inputs and outputs should be documented individually. Any quirks of behavior, gotchas, caveats, and limitations should be documented as well. All ML utils have a valid domain of applicability, and this domain is usually strictly narrower than the domain of all possible inputs on which would strictly run without errors. For example, a KNN classifier might struggle with high-dimensional data, or a linear model might struggle with highly correlated features. Make sure your users know about these limitations!
Maintenance and compatibility
We often need to serialize models, and often we use pickle to do so. pickle is great because it will
serialize almost anything, but it can be a bit of a minefield; it's easy to accidentally make some
production artifact unserializable by changing some existing class def. Tips to keep in mind when making
changes, to avoid breaking serialization of existing models:
- Resist the urge to move class defs around in the codebase
- If you must (sometimes a re-org is a really good thing), and you suspect there are artifacts out there using the classes you're moving, keep the original python files around and replace the class defs with imports from the new location, with a comment explaining why you're doing this weirdness.
- Resist the urge to significantly change the functionality of existing classes, especially if they're used in production artifacts. You could be changing the behavior of existing models inadverdently! Usually, if something weird happens, we'll find out through metrics, but it's better to just create a new-and-improved version of a class and deprecate the old one, in case you have a substantial feature to add.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thds_mllegos-1.0.20260501190816-py3-none-any.whl.
File metadata
- Download URL: thds_mllegos-1.0.20260501190816-py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb4f6be52dc5c05e4a463b3b6dba6b8f4bd387cbcabbfd1b62ec8d08a759c075
|
|
| MD5 |
bf3c749c42b13b67d6a19a1b66a6e980
|
|
| BLAKE2b-256 |
1228122a02d352f4e5809d40ecd89f30718d848c9958cfe2024729f4a5f7cf94
|