Skip to main content

Open-source platform for programmatic data labeling and weak supervision

Project description

LabelForge

LabelForge

Python License Tests Documentation Contributions Welcome

Open-source framework for programmatic weak supervision and data labeling

LabelForge is a research-oriented Python library for creating labeled datasets using weak supervision techniques. Inspired by academic research in programmatic labeling (Snorkel, Wrench), this tool allows researchers and practitioners to encode domain knowledge as simple labeling functions and combine them using probabilistic models to generate training labels for machine learning.

Table of Contents

Overview

Weak supervision addresses the bottleneck of manual data labeling by allowing users to write labeling functions (LFs) that programmatically assign labels based on heuristics, patterns, or external knowledge. LabelForge implements:

  • Labeling Functions: Simple Python functions that express domain knowledge
  • Label Model: Probabilistic model (EM algorithm) that learns LF accuracies and correlations
  • End-to-End Pipeline: From raw text to probabilistic training labels

Core Concepts

Labeling Functions (LFs): Simple functions that take an example and return a label or abstain. These encode domain expertise, heuristics, or weak signals.

Label Model: A generative model that estimates the true labels by learning the accuracy and correlation structure of the labeling functions.

Weak Supervision: The paradigm of using multiple noisy, programmatic supervision sources instead of manual labels.

labelforge run --config config.yaml

Complete automation from raw data to trained models with configurable workflows

Analytics and Monitoring

  • Real-time labeling function performance analysis

Installation

LabelForge requires Python 3.8+ and can be installed from source.

From Source (Recommended for Research)

git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
pip install -e .

Development Installation

For contributing or extending the framework:

git clone https://github.com/yourusername/LabelForge.git
cd LabelForge
pip install -e ".[dev]"
pre-commit install  # Optional: install git hooks

Dependencies

  • Core: numpy, pandas, scipy, scikit-learn
  • CLI: click
  • Dev: pytest, black, flake8, mypy, pre-commit

Quick Start

Here's a minimal example showing the core workflow:

from labelforge import lf, LabelModel, apply_lfs, Example

# Create some example data
examples = [
    Example(text="Patient has type 2 diabetes"),
    Example(text="No diabetic symptoms observed"),
    Example(text="Blood glucose levels elevated"),
    Example(text="Regular checkup, no issues")
]

# Define labeling functions
@lf(name="diabetes_mention")
def has_diabetes_keyword(example):
    """Label examples mentioning diabetes directly."""
    return 1 if "diabetes" in example.text.lower() else 0

@lf(name="diabetes_indicators") 
def has_diabetes_indicators(example):
    """Label examples with diabetes-related terms."""
    indicators = ["glucose", "insulin", "diabetic"]
    text = example.text.lower()
    return 1 if any(term in text for term in indicators) else -1  # abstain if no match

# Apply labeling functions
lf_output = apply_lfs(examples)

# Train label model to combine LF outputs
label_model = LabelModel(cardinality=2)
label_model.fit(lf_output)

# Get probabilistic labels
probs = label_model.predict_proba(lf_output)
predictions = label_model.predict(lf_output)

print(f"Predictions: {predictions}")
print(f"Probabilities shape: {probs.shape}")

For more examples, see the examples/ directory and documentation.

API Reference

Core Classes

@lf decorator

@lf(name="my_function", tags={"type": "keyword"}, abstain_label=-1)
def my_labeling_function(example):
    """Your labeling logic here."""
    return label  # or abstain_label

LabelModel

# Generative model for learning from labeling functions
model = LabelModel(
    cardinality=2,        # Number of classes
    max_iter=100,         # EM iterations
    tol=1e-4,            # Convergence tolerance
    verbose=True
)
model.fit(lf_output)
probs = model.predict_proba(lf_output)

apply_lfs

# Apply all registered LFs to examples
lf_output = apply_lfs(examples)

# Apply specific LFs
lf_output = apply_lfs(examples, lfs=[lf1, lf2])

Command Line Interface

# View registered labeling functions
labelforge lf-list

# Analyze LF performance and conflicts  
labelforge lf-stats

# Test LFs on sample data
labelforge lf-test --dataset examples/data.json

# Run end-to-end pipeline
labelforge run --input data/ --output results/

Research & Citations

LabelForge implements concepts from several research papers in weak supervision:

  1. Snorkel: Ratner et al. "Snorkel: Rapid Training Data Creation with Weak Supervision" (2017)
  2. Data Programming: Ratner et al. "Data Programming: Creating Large Training Sets, Quickly" (2016)
  3. Coral: Hancock et al. "Training Classifiers with Natural Language Explanations" (2018)

Using LabelForge in Research

If you use LabelForge in academic work, please consider citing:

@software{labelforge2025,
  title={LabelForge: Open-Source Framework for Programmatic Weak Supervision},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/LabelForge}
}

Related Work & Comparisons

  • Snorkel: Original weak supervision framework (Stanford)
  • Wrench: Benchmarking framework for weak supervision
  • cleanlab: Data-centric AI and label quality
  • skweak: Weak supervision for NLP (spaCy ecosystem)

Examples

Medical Text Classification

from labelforge import lf, load_example_data

# Load medical dataset
examples = load_example_data("medical_texts")

@lf(name="diabetes_keywords")
def diabetes_mention(example):
    keywords = ["diabetes", "diabetic", "glucose", "insulin"]
    return 1 if any(k in example.text.lower() for k in keywords) else 0

@lf(name="diabetes_medications")
def diabetes_drugs(example):
    drugs = ["metformin", "insulin", "glipizide", "glyburide"]
    return 1 if any(d in example.text.lower() for d in drugs) else 0

# See examples/medical_example.py for complete implementation

Sentiment Analysis

@lf(name="positive_sentiment")
def sentiment_positive(example):
    positive_words = ["excellent", "amazing", "love", "perfect", "great"]
    return 1 if any(word in example.text.lower() for word in positive_words) else 0

@lf(name="negative_sentiment") 
def sentiment_negative(example):
    negative_words = ["terrible", "awful", "hate", "worst", "bad"]
    return 1 if any(word in example.text.lower() for word in negative_words) else 0

Using External Models

# Example: Using pre-trained models as labeling functions
@lf(name="external_classifier")
def external_model_lf(example):
    # Your external model prediction logic
    confidence = external_model.predict_proba(example.text)[0].max()
    return 1 if confidence > 0.8 else -1  # abstain if low confidence

More examples available in the examples/ directory.

Architecture & Implementation

Core Components

Labeling Functions (lf.py)

  • Function decorator and registry system
  • Performance tracking and error handling
  • Support for abstention and metadata

Label Model (label_model.py)

  • Generative model P(L, Y) implementation
  • EM algorithm for parameter estimation
  • Handles correlations and class imbalance

Data Structures (types.py)

  • Example: Container for text and metadata
  • LFOutput: Vote matrix with analysis methods
  • Type hints and validation

Algorithm Details

The label model implements a generative approach:

  1. Generative Model: P(L, Y) = P(Y) ∏ P(L_i | Y)
  2. EM Algorithm: Alternates between:
    • E-step: Compute P(Y | L) using current parameters
    • M-step: Update parameters α (accuracies) and π (priors)
  3. Parameter Learning:
    • Accuracy: α_i = P(L_i = Y | Y)
    • Priors: π_c = P(Y = c)

Performance & Benchmarks

Computational Complexity

  • LF Application: O(n × m) where n=examples, m=functions
  • EM Training: O(n × m × c × k) where c=classes, k=iterations
  • Memory: O(n × m) for vote matrix storage

Typical Performance

Dataset Size Functions Training Time Memory Usage
1K examples 5 LFs < 1s ~10MB
10K examples 10 LFs ~5s ~50MB
100K examples 20 LFs ~30s ~200MB

Performance scales linearly with dataset size and number of functions.


**Professional Support**
- 24/7 technical support
- Custom feature development
- On-site training and consultation
- SLA guarantees for uptime and performance

Contact [enterprise@labelforge.ai](mailto:enterprise@labelforge.ai) for enterprise licensing and support.

---

## Roadmap

### Current Status (v0.1.0)
- ✅ Core labeling function framework
- ✅ Probabilistic label model with EM algorithm  
- ✅ Command-line interface
- ✅ Basic analytics and visualization
- ✅ Example datasets and documentation

### Version 1.0 (Target: Q3 2025)
- 🚧 Web-based user interface
- 🚧 Advanced model diagnostics
- 🚧 Integration with popular ML frameworks
- 🚧 Comprehensive documentation and tutorials
- 🚧 Performance optimizations

### Version 1.1 (Target: Q4 2025)
- 📋 Discriminative model training pipeline
- 📋 Advanced conflict resolution algorithms
- 📋 Real-time monitoring dashboard
- 📋 Plugin system for extensibility

### Version 2.0 (Target: Q1 2026)
- 📋 LLM-enhanced labeling function generation
- 📋 Active learning integration
- 📋 Multi-modal data support (images, audio)
- 📋 Distributed computing support
- 📋 Enterprise features and support

### Future Releases
- 📋 AutoML integration
- 📋 Real-time streaming data support
- 📋 Advanced visualization and explainability
- 📋 Cloud deployment and scaling

---

## Contributing

## Contributing

LabelForge is an open-source project that welcomes contributions from researchers, students, and practitioners. We aim to build a collaborative tool for the research community.

### Ways to Contribute

- **Research**: Implement new algorithms, improve existing methods
- **Documentation**: Tutorials, examples, use case studies
- **Bug Reports**: Help us identify and fix issues
- **Feature Requests**: Suggest new functionality for the research community
- **Examples**: Contribute domain-specific examples and datasets

### Development Setup

```bash
# Clone the repository
git clone https://github.com/yourusername/LabelForge.git
cd LabelForge

# Create development environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

# Install pre-commit hooks (optional)
pre-commit install

# Run tests
pytest tests/ -v

# Run code quality checks
black src/ tests/
flake8 src/ tests/
mypy src/

Contribution Process

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-algorithm)
  3. Add tests for your changes
  4. Ensure all tests pass (pytest tests/)
  5. Submit a pull request with clear description

See CONTRIBUTING.md for detailed guidelines.

Community

Getting Help

  • 📖 Documentation: Browse the docs/ directory
  • 🐛 Issues: Report bugs and request features
  • 💬 Discussions: Share use cases and ask questions
  • 📧 Contact: Reach out for research collaborations

Research Community

LabelForge is designed for:

  • Academic researchers studying weak supervision
  • NLP practitioners needing labeled data
  • Data scientists working with limited labeled datasets
  • Students learning about machine learning and data programming

Open Source Ecosystem

We aim to be a good citizen in the open-source ML ecosystem:

  • Interoperability: Works with scikit-learn, pandas, numpy
  • Standards: Follows Python packaging and typing standards
  • Testing: Comprehensive test suite with CI/CD
  • Documentation: Clear docs for users and contributors

License

LabelForge is released under the Apache 2.0 License, enabling both research and commercial use.

Copyright 2025 LabelForge Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

See LICENSE for the full license text.

Acknowledgments

Research Foundation
LabelForge builds upon foundational research in weak supervision:

  • Ratner et al. "Snorkel: Rapid Training Data Creation with Weak Supervision"
  • Bach et al. "Learning the Structure of Generative Models without Labeled Data"
  • Varma & Ré "Snuba: Automating Weak Supervision to Label Training Data"

Inspiration

  • Snorkel: Original weak supervision framework
  • Wrench: Comprehensive benchmarking platform
  • cleanlab: Data-centric AI principles

Contributors
Thanks to all contributors who help build and improve LabelForge. See CONTRIBUTORS.md for the full list.


Built with ❤️ for the research community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labelforge-0.1.0.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

labelforge-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file labelforge-0.1.0.tar.gz.

File metadata

  • Download URL: labelforge-0.1.0.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for labelforge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 01cc8eea62b55c0a24b2ddbc86942a59108a1c6031cd679674e5a20ea6f18f8f
MD5 dc8d788f04f248a2b915b969d66b522f
BLAKE2b-256 4447f2f3e3b77af5056089f806cf5fa843849eabbc060ee5356f6c61e7c22cc2

See more details on using hashes here.

File details

Details for the file labelforge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: labelforge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for labelforge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9feaeefce4086806221788045cb2d957f49caa3023d06b69f27411263014e40e
MD5 bf1e8e30eba9c43c09f0786f0c1a6780
BLAKE2b-256 cf9f2b78e61f6c4526e8ccbbfe6f4743f9da58dc3dce0c196a5afc7a03dffdda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page