Skip to main content

Design-based Supervised Learning (DSL) framework in Python

Project description

DSL-Kit: Design-based Supervised Learning (Python)

Repository Overview

This repository hosts parallel implementations of the Design-based Supervised Learning (DSL) framework in Python. Special thanks to Chandler L'Hommedieu for his help with the ideation and implementation of the Python version.

The primary goal of the Python implementation was to create a version that closely mirrors the statistical methodology and produces comparable results to the established R package, originally developed by Naoki Egami.

DSL combines supervised machine learning techniques with methods from survey statistics and econometrics to estimate regression models when outcome labels are only available for a non-random subset of the data (partially labeled data).

Original R Package Documentation

For the theoretical background, detailed methodology, and original R package usage, please refer to the original package resources:

Installation

Prerequisites

  • Python 3.9+
  • pip (Python package installer)

From PyPI

pip install dsl_kit

From Source

  1. Clone the repository:

    git clone https://github.com/Enan456/dsl-python.git
    cd dsl-python
    
  2. Create a virtual environment (recommended):

    python -m venv .venv
    source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
    
  3. Install in development mode:

    pip install -e .
    

Usage

The core estimation function is dsl.dsl(). Here's a basic example:

import pandas as pd
from patsy import dmatrices
from dsl_kit.dsl import dsl

# Prepare your data
# Your data should have:
# - outcome variable (y)
# - predictor variables (X)
# - labeled_ind: binary indicator for labeled data (1) or unlabeled data (0)
# - sample_prob: sampling probability for each observation

# Define your model formula
formula = "y ~ x1 + x2 + x3"

# Prepare design matrix (X) and response (y)
y, X = dmatrices(formula, data, return_type="dataframe")

# Run DSL estimation
result = dsl(
    X=X.values,
    y=y.values.flatten(),  # Ensure y is 1D
    labeled_ind=data["labeled"].values,
    sample_prob=data["sample_prob"].values,
    model="logit",  # Use "logit" for binary outcomes, "lm" for continuous
    method="logistic"  # Use "logistic" for logit, "linear" for lm
)

# Access results
print(f"Convergence: {result.success}")
print(f"Iterations: {result.niter}")
print(f"Coefficients: {result.coefficients}")
print(f"Standard Errors: {result.standard_errors}")

For a complete example using the PanChen dataset, see the tests directory.

ELI5

Imagine you have a large dataset of images, but only a few of them are labeled with their contents. DSL is like having a smart algorithm that can learn from the labeled images to predict the contents of the unlabeled ones. It uses patterns and features from the known data to make educated guesses about the unknown data, helping you understand the entire dataset better. DSL is particularly useful when working with synthetic data, where you can generate additional labeled examples to improve the model's performance.

When you have synthetic data, you can create more examples that mimic the real data. DSL can then use these synthetic examples to learn more about the patterns in your data, making it even better at predicting the contents of unlabeled images. This approach is especially helpful when you have limited real data but need a robust model.

DSL can also help you find the best way to split your data for training and testing. By analyzing how well the model performs on different parts of your data, DSL can identify effective splits that improve model accuracy. Additionally, DSL can detect biases in synthetic data, ensuring that your model is fair and representative of the real-world data it will encounter.

Potential Applications

DSL can be used in various fields, such as:

  • Social Sciences: Analyzing survey data where only a subset of responses are labeled.
  • Machine Learning: Improving model performance when labeled data is limited.
  • Econometrics: Estimating models with partially observed outcomes.
  • Healthcare: Predicting patient outcomes with limited labeled data.
  • Synthetic Data Generation: Creating and utilizing synthetic data to enhance model training and validation.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsl_kit-0.1.1.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsl_kit-0.1.1-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file dsl_kit-0.1.1.tar.gz.

File metadata

  • Download URL: dsl_kit-0.1.1.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for dsl_kit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ebeb172c4df3d1fa16f98c4345fced360e60e56218e42455dc22d58ec3fb1965
MD5 bd813a4a8340b10103fc13c1b859050b
BLAKE2b-256 0c7bebc55249e77ba3c98bd297339ca6d8e4b13ac210356f0f111cfd27b99e90

See more details on using hashes here.

File details

Details for the file dsl_kit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dsl_kit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for dsl_kit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9101f7762a62c9d2e2976a03d391c1280b01b5226da83a343e291c4803d17c8e
MD5 bb6d391f1944e70bdbf85735bae3efbd
BLAKE2b-256 5cbf1a9cb6103875fd162bbdea93fba480d57a5b56f6e8852dd79a1dd3fbd1be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page