Open-source platform for programmatic data labeling and weak supervision
Project description
LabelForge
LabelForge
Open-source framework for programmatic weak supervision and data labeling
LabelForge is a research-oriented Python library for creating labeled datasets using weak supervision techniques. Inspired by academic research in programmatic labeling (Snorkel, Wrench), this tool allows researchers and practitioners to encode domain knowledge as simple labeling functions and combine them using probabilistic models to generate training labels for machine learning.
Table of Contents
- Overview
- Installation
- Quick Start
- Core Concepts
- API Reference
- Examples
- Research & Citations
- Contributing
- Community
- License
Overview
Weak supervision addresses the bottleneck of manual data labeling by allowing users to write labeling functions (LFs) that programmatically assign labels based on heuristics, patterns, or external knowledge. LabelForge implements:
- Labeling Functions: Simple Python functions that express domain knowledge
- Label Model: Probabilistic model (EM algorithm) that learns LF accuracies and correlations
- End-to-End Pipeline: From raw text to probabilistic training labels
Core Concepts
Labeling Functions (LFs): Simple functions that take an example and return a label or abstain. These encode domain expertise, heuristics, or weak signals.
Label Model: A generative model that estimates the true labels by learning the accuracy and correlation structure of the labeling functions.
Weak Supervision: The paradigm of using multiple noisy, programmatic supervision sources instead of manual labels.
labelforge run --config config.yaml
Complete automation from raw data to trained models with configurable workflows
Analytics and Monitoring
- Real-time labeling function performance analysis
Installation
LabelForge requires Python 3.8+ and can be installed from source.
From Source (Recommended for Research)
git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
pip install -e .
Development Installation
For contributing or extending the framework:
git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
pip install -e ".[dev]"
pre-commit install # Optional: install git hooks
Dependencies
- Core: numpy, pandas, scipy, scikit-learn
- CLI: click
- Dev: pytest, black, flake8, mypy, pre-commit
Quick Start
Here's a minimal example showing the core workflow:
from labelforge import lf, LabelModel, apply_lfs, Example
# Create some example data
examples = [
Example(text="Patient has type 2 diabetes"),
Example(text="No diabetic symptoms observed"),
Example(text="Blood glucose levels elevated"),
Example(text="Regular checkup, no issues")
]
# Define labeling functions
@lf(name="diabetes_mention")
def has_diabetes_keyword(example):
"""Label examples mentioning diabetes directly."""
return 1 if "diabetes" in example.text.lower() else 0
@lf(name="diabetes_indicators")
def has_diabetes_indicators(example):
"""Label examples with diabetes-related terms."""
indicators = ["glucose", "insulin", "diabetic"]
text = example.text.lower()
return 1 if any(term in text for term in indicators) else -1 # abstain if no match
# Apply labeling functions
lf_output = apply_lfs(examples)
# Train label model to combine LF outputs
label_model = LabelModel(cardinality=2)
label_model.fit(lf_output)
# Get probabilistic labels
probs = label_model.predict_proba(lf_output)
predictions = label_model.predict(lf_output)
print(f"Predictions: {predictions}")
print(f"Probabilities shape: {probs.shape}")
For more examples, see the examples/ directory and documentation.
API Reference
Core Classes
@lf decorator
@lf(name="my_function", tags={"type": "keyword"}, abstain_label=-1)
def my_labeling_function(example):
"""Your labeling logic here."""
return label # or abstain_label
LabelModel
# Generative model for learning from labeling functions
model = LabelModel(
cardinality=2, # Number of classes
max_iter=100, # EM iterations
tol=1e-4, # Convergence tolerance
verbose=True
)
model.fit(lf_output)
probs = model.predict_proba(lf_output)
apply_lfs
# Apply all registered LFs to examples
lf_output = apply_lfs(examples)
# Apply specific LFs
lf_output = apply_lfs(examples, lfs=[lf1, lf2])
Command Line Interface
# View registered labeling functions
labelforge lf-list
# Analyze LF performance and conflicts
labelforge lf-stats
# Test LFs on sample data
labelforge lf-test --dataset examples/data.json
# Run end-to-end pipeline
labelforge run --input data/ --output results/
Research & Citations
LabelForge implements concepts from several research papers in weak supervision:
- Snorkel: Ratner et al. "Snorkel: Rapid Training Data Creation with Weak Supervision" (2017)
- Data Programming: Ratner et al. "Data Programming: Creating Large Training Sets, Quickly" (2016)
- Coral: Hancock et al. "Training Classifiers with Natural Language Explanations" (2018)
Using LabelForge in Research
If you use LabelForge in academic work, please consider citing:
@software{labelforge2025,
title={LabelForge: Open-Source Framework for Programmatic Weak Supervision},
author={Your Name},
year={2025},
url={https://github.com/yourusername/LabelForge}
}
Related Work & Comparisons
- Snorkel: Original weak supervision framework (Stanford)
- Wrench: Benchmarking framework for weak supervision
- cleanlab: Data-centric AI and label quality
- skweak: Weak supervision for NLP (spaCy ecosystem)
Examples
Medical Text Classification
from labelforge import lf, load_example_data
# Load medical dataset
examples = load_example_data("medical_texts")
@lf(name="diabetes_keywords")
def diabetes_mention(example):
keywords = ["diabetes", "diabetic", "glucose", "insulin"]
return 1 if any(k in example.text.lower() for k in keywords) else 0
@lf(name="diabetes_medications")
def diabetes_drugs(example):
drugs = ["metformin", "insulin", "glipizide", "glyburide"]
return 1 if any(d in example.text.lower() for d in drugs) else 0
# See examples/medical_example.py for complete implementation
Sentiment Analysis
@lf(name="positive_sentiment")
def sentiment_positive(example):
positive_words = ["excellent", "amazing", "love", "perfect", "great"]
return 1 if any(word in example.text.lower() for word in positive_words) else 0
@lf(name="negative_sentiment")
def sentiment_negative(example):
negative_words = ["terrible", "awful", "hate", "worst", "bad"]
return 1 if any(word in example.text.lower() for word in negative_words) else 0
Using External Models
# Example: Using pre-trained models as labeling functions
@lf(name="external_classifier")
def external_model_lf(example):
# Your external model prediction logic
confidence = external_model.predict_proba(example.text)[0].max()
return 1 if confidence > 0.8 else -1 # abstain if low confidence
More examples available in the examples/ directory.
Architecture & Implementation
Core Components
Labeling Functions (lf.py)
- Function decorator and registry system
- Performance tracking and error handling
- Support for abstention and metadata
Label Model (label_model.py)
- Generative model P(L, Y) implementation
- EM algorithm for parameter estimation
- Handles correlations and class imbalance
Data Structures (types.py)
Example: Container for text and metadataLFOutput: Vote matrix with analysis methods- Type hints and validation
Algorithm Details
The label model implements a generative approach:
- Generative Model: P(L, Y) = P(Y) ∏ P(L_i | Y)
- EM Algorithm: Alternates between:
- E-step: Compute P(Y | L) using current parameters
- M-step: Update parameters α (accuracies) and π (priors)
- Parameter Learning:
- Accuracy: α_i = P(L_i = Y | Y)
- Priors: π_c = P(Y = c)
Performance & Benchmarks
Computational Complexity
- LF Application: O(n × m) where n=examples, m=functions
- EM Training: O(n × m × c × k) where c=classes, k=iterations
- Memory: O(n × m) for vote matrix storage
Typical Performance
| Dataset Size | Functions | Training Time | Memory Usage |
|---|---|---|---|
| 1K examples | 5 LFs | < 1s | ~10MB |
| 10K examples | 10 LFs | ~5s | ~50MB |
| 100K examples | 20 LFs | ~30s | ~200MB |
Performance scales linearly with dataset size and number of functions.
**Professional Support**
- 24/7 technical support
- Custom feature development
- On-site training and consultation
- SLA guarantees for uptime and performance
Contact [enterprise@labelforge.ai](mailto:enterprise@labelforge.ai) for enterprise licensing and support.
---
## Roadmap
### Current Status (v0.1.0)
- ✅ Core labeling function framework
- ✅ Probabilistic label model with EM algorithm
- ✅ Command-line interface
- ✅ Basic analytics and visualization
- ✅ Example datasets and documentation
### Version 1.0 (Target: Q3 2025)
- 🚧 Web-based user interface
- 🚧 Advanced model diagnostics
- 🚧 Integration with popular ML frameworks
- 🚧 Comprehensive documentation and tutorials
- 🚧 Performance optimizations
### Version 1.1 (Target: Q4 2025)
- 📋 Discriminative model training pipeline
- 📋 Advanced conflict resolution algorithms
- 📋 Real-time monitoring dashboard
- 📋 Plugin system for extensibility
### Version 2.0 (Target: Q1 2026)
- 📋 LLM-enhanced labeling function generation
- 📋 Active learning integration
- 📋 Multi-modal data support (images, audio)
- 📋 Distributed computing support
- 📋 Enterprise features and support
### Future Releases
- 📋 AutoML integration
- 📋 Real-time streaming data support
- 📋 Advanced visualization and explainability
- 📋 Cloud deployment and scaling
---
## Contributing
## Contributing
LabelForge is an open-source project that welcomes contributions from researchers, students, and practitioners. We aim to build a collaborative tool for the research community.
### Ways to Contribute
- **Research**: Implement new algorithms, improve existing methods
- **Documentation**: Tutorials, examples, use case studies
- **Bug Reports**: Help us identify and fix issues
- **Feature Requests**: Suggest new functionality for the research community
- **Examples**: Contribute domain-specific examples and datasets
### Development Setup
```bash
# Clone the repository
git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
# Create development environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e ".[dev]"
# Install pre-commit hooks (optional)
pre-commit install
# Run tests
pytest tests/ -v
# Run code quality checks
black src/ tests/
flake8 src/ tests/
mypy src/
Contribution Process
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-algorithm) - Add tests for your changes
- Ensure all tests pass (
pytest tests/) - Submit a pull request with clear description
See CONTRIBUTING.md for detailed guidelines.
Community
Getting Help
- 📖 Documentation: Browse the docs/ directory
- 🐛 Issues: Report bugs and request features
- 💬 Discussions: Share use cases and ask questions
- 📧 Contact: Reach out for research collaborations
Research Community
LabelForge is designed for:
- Academic researchers studying weak supervision
- NLP practitioners needing labeled data
- Data scientists working with limited labeled datasets
- Students learning about machine learning and data programming
Open Source Ecosystem
We aim to be a good citizen in the open-source ML ecosystem:
- Interoperability: Works with scikit-learn, pandas, numpy
- Standards: Follows Python packaging and typing standards
- Testing: Comprehensive test suite with CI/CD
- Documentation: Clear docs for users and contributors
License
LabelForge is released under the Apache 2.0 License, enabling both research and commercial use.
Copyright 2025 LabelForge Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
See LICENSE for the full license text.
Acknowledgments
Research Foundation
LabelForge builds upon foundational research in weak supervision:
- Ratner et al. "Snorkel: Rapid Training Data Creation with Weak Supervision"
- Bach et al. "Learning the Structure of Generative Models without Labeled Data"
- Varma & Ré "Snuba: Automating Weak Supervision to Label Training Data"
Inspiration
- Snorkel: Original weak supervision framework
- Wrench: Comprehensive benchmarking platform
- cleanlab: Data-centric AI principles
Contributors
Thanks to all contributors who help build and improve LabelForge. See CONTRIBUTORS.md for the full list.
Built with ❤️ for the research community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file labelforge-0.1.0.tar.gz.
File metadata
- Download URL: labelforge-0.1.0.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01cc8eea62b55c0a24b2ddbc86942a59108a1c6031cd679674e5a20ea6f18f8f
|
|
| MD5 |
dc8d788f04f248a2b915b969d66b522f
|
|
| BLAKE2b-256 |
4447f2f3e3b77af5056089f806cf5fa843849eabbc060ee5356f6c61e7c22cc2
|
File details
Details for the file labelforge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: labelforge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9feaeefce4086806221788045cb2d957f49caa3023d06b69f27411263014e40e
|
|
| MD5 |
bf1e8e30eba9c43c09f0786f0c1a6780
|
|
| BLAKE2b-256 |
cf9f2b78e61f6c4526e8ccbbfe6f4743f9da58dc3dce0c196a5afc7a03dffdda
|