Design-based Supervised Learning (DSL) framework in Python

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Project description

DSL-Kit: Design-based Supervised Learning (Python)

Repository Overview

This repository hosts parallel implementations of the Design-based Supervised Learning (DSL) framework in Python. Special thanks to Chandler L'Hommedieu for his help with the ideation and implementation of the Python version.

The primary goal of the Python implementation was to create a version that closely mirrors the statistical methodology and produces comparable results to the established R package, originally developed by Naoki Egami.

DSL combines supervised machine learning techniques with methods from survey statistics and econometrics to estimate regression models when outcome labels are only available for a non-random subset of the data (partially labeled data).

Original R Package Documentation

For the theoretical background, detailed methodology, and original R package usage, please refer to the original package resources:

Package Website & Vignettes: http://naokiegami.com/dsl
Original R Package Repository: https://github.com/naoki-egami/dsl

Installation

Prerequisites

Python 3.9+
pip (Python package installer)

From PyPI

pip install dsl_kit

From Source

Clone the repository:

git clone https://github.com/Enan456/dsl-python.git
cd dsl-python

Create a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

Install in development mode:
```
pip install -e .
```

Usage

The core estimation function is dsl.dsl(). Here's a basic example:

import pandas as pd
from patsy import dmatrices
from dsl_kit.dsl import dsl

# Prepare your data
# Your data should have:
# - outcome variable (y)
# - predictor variables (X)
# - labeled_ind: binary indicator for labeled data (1) or unlabeled data (0)
# - sample_prob: sampling probability for each observation

# Define your model formula
formula = "y ~ x1 + x2 + x3"

# Prepare design matrix (X) and response (y)
y, X = dmatrices(formula, data, return_type="dataframe")

# Run DSL estimation
result = dsl(
    X=X.values,
    y=y.values.flatten(),  # Ensure y is 1D
    labeled_ind=data["labeled"].values,
    sample_prob=data["sample_prob"].values,
    model="logit",  # Use "logit" for binary outcomes, "lm" for continuous
    method="logistic"  # Use "logistic" for logit, "linear" for lm
)

# Access results
print(f"Convergence: {result.success}")
print(f"Iterations: {result.niter}")
print(f"Coefficients: {result.coefficients}")
print(f"Standard Errors: {result.standard_errors}")

For a complete example using the PanChen dataset, see the tests directory.

ELI5

Imagine you have a large dataset of images, but only a few of them are labeled with their contents. DSL is like having a smart algorithm that can learn from the labeled images to predict the contents of the unlabeled ones. It uses patterns and features from the known data to make educated guesses about the unknown data, helping you understand the entire dataset better. DSL is particularly useful when working with synthetic data, where you can generate additional labeled examples to improve the model's performance.

When you have synthetic data, you can create more examples that mimic the real data. DSL can then use these synthetic examples to learn more about the patterns in your data, making it even better at predicting the contents of unlabeled images. This approach is especially helpful when you have limited real data but need a robust model.

DSL can also help you find the best way to split your data for training and testing. By analyzing how well the model performs on different parts of your data, DSL can identify effective splits that improve model accuracy. Additionally, DSL can detect biases in synthetic data, ensuring that your model is fair and representative of the real-world data it will encounter.

Potential Applications

DSL can be used in various fields, such as:

Social Sciences: Analyzing survey data where only a subset of responses are labeled.
Machine Learning: Improving model performance when labeled data is limited.
Econometrics: Estimating models with partially observed outcomes.
Healthcare: Predicting patient outcomes with limited labeled data.
Synthetic Data Generation: Creating and utilizing synthetic data to enhance model training and validation.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.1.1

Apr 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsl_kit-0.1.1.tar.gz (10.3 kB view details)

Uploaded Apr 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsl_kit-0.1.1-py3-none-any.whl (8.1 kB view details)

Uploaded Apr 6, 2025 Python 3

File details

Details for the file dsl_kit-0.1.1.tar.gz.

File metadata

Download URL: dsl_kit-0.1.1.tar.gz
Upload date: Apr 6, 2025
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for dsl_kit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ebeb172c4df3d1fa16f98c4345fced360e60e56218e42455dc22d58ec3fb1965`
MD5	`bd813a4a8340b10103fc13c1b859050b`
BLAKE2b-256	`0c7bebc55249e77ba3c98bd297339ca6d8e4b13ac210356f0f111cfd27b99e90`

See more details on using hashes here.

File details

Details for the file dsl_kit-0.1.1-py3-none-any.whl.

File metadata

Download URL: dsl_kit-0.1.1-py3-none-any.whl
Upload date: Apr 6, 2025
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for dsl_kit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9101f7762a62c9d2e2976a03d391c1280b01b5226da83a343e291c4803d17c8e`
MD5	`bb6d391f1944e70bdbf85735bae3efbd`
BLAKE2b-256	`5cbf1a9cb6103875fd162bbdea93fba480d57a5b56f6e8852dd79a1dd3fbd1be`

See more details on using hashes here.

dsl-kit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DSL-Kit: Design-based Supervised Learning (Python)

Repository Overview

Original R Package Documentation

Installation

Prerequisites

From PyPI

From Source

Usage

ELI5

Potential Applications

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes