Skip to main content

Composable components for basic ML tasks

Project description

thds.mllegos Library

Composable components for basic ML tasks

Why this library exists

At Trilliant Health we operate in one primary domain - medical claims data - so we have a lot of shared concepts across ML projects. Hence, it would be nice to gather shared utilities related to doing ML on that data in one place, and not have to reinvent many wheels. This library is meant to be that place.

Modularity and composability

As the name would suggest, mllegos is meant to be home for small, modular, composable components. You should be able to glue them together into new conglomerations that no one has thought of before - like legos! If you're adding something, make sure it's small and does one thing well. Ask yourself the following about any new lego blocks you want to add:

  • Does my lego block do one thing well?
  • Does it compose well with other lego blocks in the library?
  • Does it work on a simple set of inputs of well-known type, not requiring the user to jump through hoops to get data into the right shape?
  • Does it have esoteric dependencies that would make it hard to use in a different context?

All the legos in the library strive to fulfill these criteria. When the lego blocks are small and do one thing well, they're easier to test, maintain, understand, and use in new and unexpected settings.

A sampling of the current legos

Eval

sklegos.eval contains the following legos:

  • cls_report contains helper functions for working with the dict output of sklearn.metrics.classification_report
    • to_pandas turns it into a pandas DataFrame
    • multiclass_performance_viz creates an interactive scatterplot of performance metrics (precision/recall/f1-score/support) for all classes
  • viz.basic contains some slighlty lower-level wrappers for making more custom scatterplots and bar charts with pyecharts
  • see notebooks/demo_sklearn_classification_report_tools.ipynb for working examples of all of the above

Feature extraction

sklegos.feature_extraction contains the following legos:

  • OptimizedPrefixEncoding is useful when dealing with coding systems which use string structure to encode hierarchy membership from left to right (e.g. ICD-10 diagnoses/procedures, NUCC taxonomies). It dynamically estimates code vocabularies by aggregating evidence from rare codes to shared prefixes with better support.

Feature selection

sklegos.feature_selection contains the following legos:

  • DynamicFeatureSelector is useful when either:

    • you have a large number of features which correlate with your target, but you can't afford to fit a model with all of them
    • you have no features or very few features which correlate with your target

    This lego will first try your feature selector of choice, and if that returns too few or too many features, will either select some of the remaining features, or further filter the selected features, according to a criterion of your choice.

Imputation

sklegos.imputation contains the following legos:

  • ConditionalImputer is a simple imputation scheme for filling in missing values of a continuous feature using point estimates from their conditional distributions given other discrete features. Useful when:

    1. your model in incompatible with missing values
    2. the discrete conditioning features are fine-grained enough to provide some specificity about the continuous feature
    3. but with good enough support to enable a robust estimate.

    Example: imputing missing patient age in claims using a combination of payer type (medicare, medicaid, commercial, etc), sex, and provider specialty.

Modeling

sklegos.modeling contains the following legos:

  • DiscreteFeatureSplit is a meta-estimator that fits a separate instance of an estimator of a fixed architecture on each subset of the training data that corresponds to a unique value of a discrete feature. Useful when:

    1. the meaning and utility of many other features vary significantly depending on the value of the discrete feature
    2. the discrete feature has enough support in many values to enable robust estimation of the model on each subset.

    Example: given a classification task to determine if an ambiguous DRG code is MS or AP, fit a separate classifier for each distinct DRG code. Criterion 1 is satisfied often since specific diagnoses or patient demographics are highly discriminative for specific DRGs but not globally. E.g. age and sex are not highly discriminative globally but are very useful for distinguishing AP 775 (alcohol abuse/dependence) from MS 775 (vaginal delivery).

  • TreeStructuredLabelsClassifier is a meta-estimator that fits a separate instance of a classifier for each internal node of an arbitrary tree structure defined on labels (not features). The classifier should estimate conditional probabilities of the class given the features (i.e. it should support a predict_proba method). The meta-classifier estimates class probabilities by applying the chain rule of probability to the sub-classifier probabilities along the tree structure. Counterintuitively, this can result in more accurate predictions and also a large reduction of computational cost, depending on the size of the training set and the cardinality of the label set. A SubtreeExecutor callback enables pluggable parallelism — subtree fits can be dispatched to remote nodes, threads, or run sequentially (the default). A presentation of the idea and some initial research results on toy problems can be found here. Useful when:

    1. the labels targeted by the classification task have a known hierarchical semantic structure (e.g. ICD-10 diagnoses/procedures, NUCC taxonomies)
    2. the cardinality of the labels and the size of the dataset are such that the computational cost of training classifier for each label is prohibitive.

    Ideas for future improvement: It would be nice to be able to learn a tree structure when there is no standard or explicit one available for the target classes. The point of the meta-classifier is to simplify the problem by decomposing it - why not use the confusion matrix from an initial global classifier to hierarchically group classes into groups which are highly confusable when tossed in with all the others, but might be more distinguishable when considered in isolation?

Structure of the library

thds.mllegos was started with the following initial submodules:

  • feature_extraction is meant to contain generalized feature extraction utils.
    • code in here should be agnostic to any particular 3rd party library or framework. The idea is that you could plug it in to some code that does use such a framework, but you're leaving that interface as an extra layer of abstraction.
  • sklegos is meant to contain any sklearn-compatible interfaces for e.g. feature selection, ensembling, feature extraction, imputation, etc.
    • Right now it contains feature_extraction, imputation, and modeling submodules, which are meant to contain utilities for those tasks
    • Most things in here should inherit from something in sklearn.base (e.g. BaseEstimator), and be compatible with the sklearn API
  • util, as the name suggests, is meant to be a grab bag.
    • The main requirements for adding things there are that
      1. most things in there should be pure python with few if any 3rd party library dependencies
      2. new additions should be in clearly named submodules that encapsulate small modular bits of functionality (well, really that goes for the whole library 😄)
    • Right now for instance, there are some type aliases, high-level functional tools, and a couple data structures for working with trees and heaps

Note that, while most of the things I have implemented so far are sklearn-compatible, that's by no means a requirement for additions to the library! Hence why I explicitly created an sklegos submodule to put those things in, rather than just including them as root submodules. In fact, I didn't even want sklearn to be a required transitive dependency of the library, hence my addition of an sklearn extra in the package spec - you have to explicitly opt in to bring on sklearn as a transitive dependency for your project. I imagine in the future we could have similar submodules for other ML frameworks, e.g. a torch-legos submodule (or firebricks? Naming things is fun!).

Feel free to add more submodules as needed (guidelines here)! If there's something you'd like to see added to the libary, but you don't have time to get to it right away, feel free to add it to the wish list.

Contributing

If you have a small piece of code that you think would be useful to others, feel free to add it! Just follow the guidelines below.

We maintain a standing wish-list of features we'd like to see added to the library in this issue. If you're adding something, check to see if it's on that list so you can cross it off!

Tests

Most utils in the current library have good test coverage at this point. ML utils can be a bit delicate and nuanced, since they have to handle lots of strange cases that occur out in the wild world of real data. If you add anything marginally complicated, it should have unit tests that exercise various edge cases.

Documentation

Public-facing functions and classes should have docstrings. The main purpose of the component should be documented, and each of its inputs and outputs should be documented individually. Any quirks of behavior, gotchas, caveats, and limitations should be documented as well. All ML utils have a valid domain of applicability, and this domain is usually strictly narrower than the domain of all possible inputs on which would strictly run without errors. For example, a KNN classifier might struggle with high-dimensional data, or a linear model might struggle with highly correlated features. Make sure your users know about these limitations!

Maintenance and compatibility

We often need to serialize models, and often we use pickle to do so. pickle is great because it will serialize almost anything, but it can be a bit of a minefield; it's easy to accidentally make some production artifact unserializable by changing some existing class def. Tips to keep in mind when making changes, to avoid breaking serialization of existing models:

  • Resist the urge to move class defs around in the codebase
  • If you must (sometimes a re-org is a really good thing), and you suspect there are artifacts out there using the classes you're moving, keep the original python files around and replace the class defs with imports from the new location, with a comment explaining why you're doing this weirdness.
  • Resist the urge to significantly change the functionality of existing classes, especially if they're used in production artifacts. You could be changing the behavior of existing models inadverdently! Usually, if something weird happens, we'll find out through metrics, but it's better to just create a new-and-improved version of a class and deprecate the old one, in case you have a substantial feature to add.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thds_mllegos-1.0.20260501190816-py3-none-any.whl (43.8 kB view details)

Uploaded Python 3

File details

Details for the file thds_mllegos-1.0.20260501190816-py3-none-any.whl.

File metadata

File hashes

Hashes for thds_mllegos-1.0.20260501190816-py3-none-any.whl
Algorithm Hash digest
SHA256 eb4f6be52dc5c05e4a463b3b6dba6b8f4bd387cbcabbfd1b62ec8d08a759c075
MD5 bf3c749c42b13b67d6a19a1b66a6e980
BLAKE2b-256 1228122a02d352f4e5809d40ecd89f30718d848c9958cfe2024729f4a5f7cf94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page