Design-based Supervised Learning (DSL) framework in Python
Project description
DSL-Kit: Design-based Supervised Learning (Python)
Repository Overview
This repository hosts parallel implementations of the Design-based Supervised Learning (DSL) framework in Python. Special thanks to Chandler L'Hommedieu for his help with the ideation and implementation of the Python version.
The primary goal of the Python implementation was to create a version that closely mirrors the statistical methodology and produces comparable results to the established R package, originally developed by Naoki Egami.
DSL combines supervised machine learning techniques with methods from survey statistics and econometrics to estimate regression models when outcome labels are only available for a non-random subset of the data (partially labeled data).
Original R Package Documentation
For the theoretical background, detailed methodology, and original R package usage, please refer to the original package resources:
- Package Website & Vignettes: http://naokiegami.com/dsl
- Original R Package Repository: https://github.com/naoki-egami/dsl
Installation
Prerequisites
- Python 3.9+
- pip (Python package installer)
From PyPI
pip install dsl_kit
From Source
-
Clone the repository:
git clone https://github.com/Enan456/dsl-python.git cd dsl-python
-
Create a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
-
Install in development mode:
pip install -e .
Usage
The core estimation function is dsl.dsl(). Here's a basic example:
import pandas as pd
from patsy import dmatrices
from dsl_kit.dsl import dsl
# Prepare your data
# Your data should have:
# - outcome variable (y)
# - predictor variables (X)
# - labeled_ind: binary indicator for labeled data (1) or unlabeled data (0)
# - sample_prob: sampling probability for each observation
# Define your model formula
formula = "y ~ x1 + x2 + x3"
# Prepare design matrix (X) and response (y)
y, X = dmatrices(formula, data, return_type="dataframe")
# Run DSL estimation
result = dsl(
X=X.values,
y=y.values.flatten(), # Ensure y is 1D
labeled_ind=data["labeled"].values,
sample_prob=data["sample_prob"].values,
model="logit", # Use "logit" for binary outcomes, "lm" for continuous
method="logistic" # Use "logistic" for logit, "linear" for lm
)
# Access results
print(f"Convergence: {result.success}")
print(f"Iterations: {result.niter}")
print(f"Coefficients: {result.coefficients}")
print(f"Standard Errors: {result.standard_errors}")
For a complete example using the PanChen dataset, see the tests directory.
ELI5
Imagine you have a large dataset of images, but only a few of them are labeled with their contents. DSL is like having a smart algorithm that can learn from the labeled images to predict the contents of the unlabeled ones. It uses patterns and features from the known data to make educated guesses about the unknown data, helping you understand the entire dataset better. DSL is particularly useful when working with synthetic data, where you can generate additional labeled examples to improve the model's performance.
When you have synthetic data, you can create more examples that mimic the real data. DSL can then use these synthetic examples to learn more about the patterns in your data, making it even better at predicting the contents of unlabeled images. This approach is especially helpful when you have limited real data but need a robust model.
DSL can also help you find the best way to split your data for training and testing. By analyzing how well the model performs on different parts of your data, DSL can identify effective splits that improve model accuracy. Additionally, DSL can detect biases in synthetic data, ensuring that your model is fair and representative of the real-world data it will encounter.
Potential Applications
DSL can be used in various fields, such as:
- Social Sciences: Analyzing survey data where only a subset of responses are labeled.
- Machine Learning: Improving model performance when labeled data is limited.
- Econometrics: Estimating models with partially observed outcomes.
- Healthcare: Predicting patient outcomes with limited labeled data.
- Synthetic Data Generation: Creating and utilizing synthetic data to enhance model training and validation.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsl_kit-0.1.1.tar.gz.
File metadata
- Download URL: dsl_kit-0.1.1.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebeb172c4df3d1fa16f98c4345fced360e60e56218e42455dc22d58ec3fb1965
|
|
| MD5 |
bd813a4a8340b10103fc13c1b859050b
|
|
| BLAKE2b-256 |
0c7bebc55249e77ba3c98bd297339ca6d8e4b13ac210356f0f111cfd27b99e90
|
File details
Details for the file dsl_kit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dsl_kit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9101f7762a62c9d2e2976a03d391c1280b01b5226da83a343e291c4803d17c8e
|
|
| MD5 |
bb6d391f1944e70bdbf85735bae3efbd
|
|
| BLAKE2b-256 |
5cbf1a9cb6103875fd162bbdea93fba480d57a5b56f6e8852dd79a1dd3fbd1be
|