Open-source platform for programmatic data labeling and weak supervision

These details have not been verified by PyPI

Project links

Project description

LabelForge

Open-source framework for programmatic weak supervision and data labeling

LabelForge is a research-oriented Python library for creating labeled datasets using weak supervision techniques. Inspired by academic research in programmatic labeling (Snorkel, Wrench), this tool allows researchers and practitioners to encode domain knowledge as simple labeling functions and combine them using probabilistic models to generate training labels for machine learning.

Overview
Installation
Quick Start
Core Concepts
API Reference
Examples
Research & Citations
Contributing
Community
License

Overview

Weak supervision addresses the bottleneck of manual data labeling by allowing users to write labeling functions (LFs) that programmatically assign labels based on heuristics, patterns, or external knowledge. LabelForge implements:

Labeling Functions: Simple Python functions that express domain knowledge
Label Model: Probabilistic model (EM algorithm) that learns LF accuracies and correlations
End-to-End Pipeline: From raw text to probabilistic training labels

Core Concepts

Labeling Functions (LFs): Simple functions that take an example and return a label or abstain. These encode domain expertise, heuristics, or weak signals.

Label Model: A generative model that estimates the true labels by learning the accuracy and correlation structure of the labeling functions.

Weak Supervision: The paradigm of using multiple noisy, programmatic supervision sources instead of manual labels.

labelforge run --config config.yaml

Complete automation from raw data to trained models with configurable workflows

Analytics and Monitoring

Real-time labeling function performance analysis

Installation

LabelForge requires Python 3.8+ and can be installed from source.

From Source (Recommended for Research)

git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
pip install -e .

Development Installation

For contributing or extending the framework:

git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
pip install -e ".[dev]"
pre-commit install  # Optional: install git hooks

Dependencies

Core: numpy, pandas, scipy, scikit-learn
CLI: click
Dev: pytest, black, flake8, mypy, pre-commit

Quick Start

Here's a minimal example showing the core workflow:

from labelforge import lf, LabelModel, apply_lfs, Example

# Create some example data
examples = [
    Example(text="Patient has type 2 diabetes"),
    Example(text="No diabetic symptoms observed"),
    Example(text="Blood glucose levels elevated"),
    Example(text="Regular checkup, no issues")
]

# Define labeling functions
@lf(name="diabetes_mention")
def has_diabetes_keyword(example):
    """Label examples mentioning diabetes directly."""
    return 1 if "diabetes" in example.text.lower() else 0

@lf(name="diabetes_indicators") 
def has_diabetes_indicators(example):
    """Label examples with diabetes-related terms."""
    indicators = ["glucose", "insulin", "diabetic"]
    text = example.text.lower()
    return 1 if any(term in text for term in indicators) else -1  # abstain if no match

# Apply labeling functions
lf_output = apply_lfs(examples)

# Train label model to combine LF outputs
label_model = LabelModel(cardinality=2)
label_model.fit(lf_output)

# Get probabilistic labels
probs = label_model.predict_proba(lf_output)
predictions = label_model.predict(lf_output)

print(f"Predictions: {predictions}")
print(f"Probabilities shape: {probs.shape}")

For more examples, see the examples/ directory and documentation.

API Reference

Core Classes

`@lf` decorator

@lf(name="my_function", tags={"type": "keyword"}, abstain_label=-1)
def my_labeling_function(example):
    """Your labeling logic here."""
    return label  # or abstain_label

`LabelModel`

# Generative model for learning from labeling functions
model = LabelModel(
    cardinality=2,        # Number of classes
    max_iter=100,         # EM iterations
    tol=1e-4,            # Convergence tolerance
    verbose=True
)
model.fit(lf_output)
probs = model.predict_proba(lf_output)

`apply_lfs`

# Apply all registered LFs to examples
lf_output = apply_lfs(examples)

# Apply specific LFs
lf_output = apply_lfs(examples, lfs=[lf1, lf2])

Command Line Interface

# View registered labeling functions
labelforge lf-list

# Analyze LF performance and conflicts  
labelforge lf-stats

# Test LFs on sample data
labelforge lf-test --dataset examples/data.json

# Run end-to-end pipeline
labelforge run --input data/ --output results/

Research & Citations

LabelForge implements concepts from several research papers in weak supervision:

Snorkel: Ratner et al. "Snorkel: Rapid Training Data Creation with Weak Supervision" (2017)
Data Programming: Ratner et al. "Data Programming: Creating Large Training Sets, Quickly" (2016)
Coral: Hancock et al. "Training Classifiers with Natural Language Explanations" (2018)

Using LabelForge in Research

If you use LabelForge in academic work, please consider citing:

@software{labelforge2025,
  title={LabelForge: Open-Source Framework for Programmatic Weak Supervision},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/LabelForge}
}

Related Work & Comparisons

Snorkel: Original weak supervision framework (Stanford)
Wrench: Benchmarking framework for weak supervision
cleanlab: Data-centric AI and label quality
skweak: Weak supervision for NLP (spaCy ecosystem)

Examples

Medical Text Classification

from labelforge import lf, load_example_data

# Load medical dataset
examples = load_example_data("medical_texts")

@lf(name="diabetes_keywords")
def diabetes_mention(example):
    keywords = ["diabetes", "diabetic", "glucose", "insulin"]
    return 1 if any(k in example.text.lower() for k in keywords) else 0

@lf(name="diabetes_medications")
def diabetes_drugs(example):
    drugs = ["metformin", "insulin", "glipizide", "glyburide"]
    return 1 if any(d in example.text.lower() for d in drugs) else 0

# See examples/medical_example.py for complete implementation

Sentiment Analysis

@lf(name="positive_sentiment")
def sentiment_positive(example):
    positive_words = ["excellent", "amazing", "love", "perfect", "great"]
    return 1 if any(word in example.text.lower() for word in positive_words) else 0

@lf(name="negative_sentiment") 
def sentiment_negative(example):
    negative_words = ["terrible", "awful", "hate", "worst", "bad"]
    return 1 if any(word in example.text.lower() for word in negative_words) else 0

Using External Models

# Example: Using pre-trained models as labeling functions
@lf(name="external_classifier")
def external_model_lf(example):
    # Your external model prediction logic
    confidence = external_model.predict_proba(example.text)[0].max()
    return 1 if confidence > 0.8 else -1  # abstain if low confidence

More examples available in the examples/ directory.

Architecture & Implementation

Core Components

Labeling Functions (lf.py)

Function decorator and registry system
Performance tracking and error handling
Support for abstention and metadata

Label Model (label_model.py)

Generative model P(L, Y) implementation
EM algorithm for parameter estimation
Handles correlations and class imbalance

Data Structures (types.py)

Example: Container for text and metadata
LFOutput: Vote matrix with analysis methods
Type hints and validation

Algorithm Details

The label model implements a generative approach:

Generative Model: P(L, Y) = P(Y) ∏ P(L_i | Y)
EM Algorithm: Alternates between:
- E-step: Compute P(Y | L) using current parameters
- M-step: Update parameters α (accuracies) and π (priors)
Parameter Learning:
- Accuracy: α_i = P(L_i = Y | Y)
- Priors: π_c = P(Y = c)

Performance & Benchmarks

Computational Complexity

LF Application: O(n × m) where n=examples, m=functions
EM Training: O(n × m × c × k) where c=classes, k=iterations
Memory: O(n × m) for vote matrix storage

Typical Performance

Dataset Size	Functions	Training Time	Memory Usage
1K examples	5 LFs	< 1s	~10MB
10K examples	10 LFs	~5s	~50MB
100K examples	20 LFs	~30s	~200MB

Performance scales linearly with dataset size and number of functions.


**Professional Support**
- 24/7 technical support
- Custom feature development
- On-site training and consultation
- SLA guarantees for uptime and performance

Contact [enterprise@labelforge.ai](mailto:enterprise@labelforge.ai) for enterprise licensing and support.

---

## Roadmap

### Current Status (v0.1.0)
- ✅ Core labeling function framework
- ✅ Probabilistic label model with EM algorithm  
- ✅ Command-line interface
- ✅ Basic analytics and visualization
- ✅ Example datasets and documentation

### Version 1.0 (Target: Q3 2025)
- 🚧 Web-based user interface
- 🚧 Advanced model diagnostics
- 🚧 Integration with popular ML frameworks
- 🚧 Comprehensive documentation and tutorials
- 🚧 Performance optimizations

### Version 1.1 (Target: Q4 2025)
- 📋 Discriminative model training pipeline
- 📋 Advanced conflict resolution algorithms
- 📋 Real-time monitoring dashboard
- 📋 Plugin system for extensibility

### Version 2.0 (Target: Q1 2026)
- 📋 LLM-enhanced labeling function generation
- 📋 Active learning integration
- 📋 Multi-modal data support (images, audio)
- 📋 Distributed computing support
- 📋 Enterprise features and support

### Future Releases
- 📋 AutoML integration
- 📋 Real-time streaming data support
- 📋 Advanced visualization and explainability
- 📋 Cloud deployment and scaling

---

## Contributing

## Contributing

LabelForge is an open-source project that welcomes contributions from researchers, students, and practitioners. We aim to build a collaborative tool for the research community.

### Ways to Contribute

- **Research**: Implement new algorithms, improve existing methods
- **Documentation**: Tutorials, examples, use case studies
- **Bug Reports**: Help us identify and fix issues
- **Feature Requests**: Suggest new functionality for the research community
- **Examples**: Contribute domain-specific examples and datasets

### Development Setup

```bash
# Clone the repository
git clone https://github.com/yourusername/LabelForge.git
cd LabelForge

# Create development environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

# Install pre-commit hooks (optional)
pre-commit install

# Run tests
pytest tests/ -v

# Run code quality checks
black src/ tests/
flake8 src/ tests/
mypy src/

Contribution Process

Fork the repository
Create a feature branch (git checkout -b feature/new-algorithm)
Add tests for your changes
Ensure all tests pass (pytest tests/)
Submit a pull request with clear description

See CONTRIBUTING.md for detailed guidelines.

Community

Getting Help

📖 Documentation: Browse the docs/ directory
🐛 Issues: Report bugs and request features
💬 Discussions: Share use cases and ask questions
📧 Contact: Reach out for research collaborations

Research Community

LabelForge is designed for:

Academic researchers studying weak supervision
NLP practitioners needing labeled data
Data scientists working with limited labeled datasets
Students learning about machine learning and data programming

Open Source Ecosystem

We aim to be a good citizen in the open-source ML ecosystem:

Interoperability: Works with scikit-learn, pandas, numpy
Standards: Follows Python packaging and typing standards
Testing: Comprehensive test suite with CI/CD
Documentation: Clear docs for users and contributors

License

LabelForge is released under the Apache 2.0 License, enabling both research and commercial use.

Copyright 2025 LabelForge Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

See LICENSE for the full license text.

Acknowledgments

Research Foundation
LabelForge builds upon foundational research in weak supervision:

Ratner et al. "Snorkel: Rapid Training Data Creation with Weak Supervision"
Bach et al. "Learning the Structure of Generative Models without Labeled Data"
Varma & Ré "Snuba: Automating Weak Supervision to Label Training Data"

Inspiration

Snorkel: Original weak supervision framework
Wrench: Comprehensive benchmarking platform
cleanlab: Data-centric AI principles

Contributors
Thanks to all contributors who help build and improve LabelForge. See CONTRIBUTORS.md for the full list.

Built with ❤️ for the research community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labelforge-0.1.0.tar.gz (30.8 kB view details)

Uploaded Jul 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

labelforge-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Jul 9, 2025 Python 3

File details

Details for the file labelforge-0.1.0.tar.gz.

File metadata

Download URL: labelforge-0.1.0.tar.gz
Upload date: Jul 9, 2025
Size: 30.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for labelforge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`01cc8eea62b55c0a24b2ddbc86942a59108a1c6031cd679674e5a20ea6f18f8f`
MD5	`dc8d788f04f248a2b915b969d66b522f`
BLAKE2b-256	`4447f2f3e3b77af5056089f806cf5fa843849eabbc060ee5356f6c61e7c22cc2`

See more details on using hashes here.

File details

Details for the file labelforge-0.1.0-py3-none-any.whl.

File metadata

Download URL: labelforge-0.1.0-py3-none-any.whl
Upload date: Jul 9, 2025
Size: 24.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for labelforge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9feaeefce4086806221788045cb2d957f49caa3023d06b69f27411263014e40e`
MD5	`bf1e8e30eba9c43c09f0786f0c1a6780`
BLAKE2b-256	`cf9f2b78e61f6c4526e8ccbbfe6f4743f9da58dc3dce0c196a5afc7a03dffdda`

See more details on using hashes here.

labelforge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LabelForge

LabelForge

Table of Contents

Overview

Core Concepts

Installation

From Source (Recommended for Research)

Development Installation

Dependencies

Quick Start

API Reference

Core Classes

@lf decorator

LabelModel

apply_lfs

Command Line Interface

Research & Citations

Using LabelForge in Research

Related Work & Comparisons

Examples

Medical Text Classification

Sentiment Analysis

Using External Models

Architecture & Implementation

Core Components

Algorithm Details

Performance & Benchmarks

Computational Complexity

Typical Performance

Contribution Process

Community

Getting Help

Research Community

Open Source Ecosystem

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`@lf` decorator

`LabelModel`

`apply_lfs`