Extract affiliation and local designation from labor union names
Project description
Labor Union Parser
Extract affiliation and local designation from labor union name strings.
Given an input like "SEIU Local 1199", the parser returns:
- Affiliation:
SEIU(Service Employees International Union) - Designation:
1199(local number)
Installation
pip install -e .
Usage
Command Line
# Basic usage
labor-union-parser "SEIU Local 1199"
# Output:
# Affiliation: SEIU
# Designation: 1199
# JSON output
labor-union-parser "Teamsters Local 705" --json
# Output: {"affiliation": "IBT", "designation": "705"}
Python API
from labor_union_parser import extract
result = extract("UAW Local 600")
print(result)
# {'affiliation': 'UAW', 'designation': '600'}
For batch processing:
from labor_union_parser import Extractor
extractor = Extractor()
results = extractor.extract_batch([
"SEIU Local 1199",
"Teamsters Local 705",
"UAW Local 600",
])
# [{'affiliation': 'SEIU', 'designation': '1199'},
# {'affiliation': 'IBT', 'designation': '705'},
# {'affiliation': 'UAW', 'designation': '600'}]
For large datasets, use extract_all which yields results as a generator:
from labor_union_parser import Extractor
extractor = Extractor()
# Process large list with progress bar
for result in extractor.extract_all(union_names, show_progress=True):
print(result)
# Adjust batch size for memory/speed tradeoff
results = list(extractor.extract_all(union_names, batch_size=512))
Training
Training data is in training/data/labeled_data.csv with columns:
text: Union name stringaff_abbr: Affiliation abbreviation (e.g., "SEIU", "IBT", "UAW")desig_num: Local designation numbersplit: One of "train", "val", or "test"
To retrain the model:
pip install -e ".[train]" # Install training dependencies
python training/train.py
The training script will:
- Load data from
training/data/labeled_data.csv - Train for 10 epochs with early stopping based on validation performance
- Save the best model to
src/labor_union_parser/weights/char_cnn.pt
Training Data Statistics
| Split | Examples |
|---|---|
| Train | 82,900 |
| Val | 2,296 |
| Test | 2,413 |
Model Architecture
The model uses a CharCNN architecture with pointer-based designation selection:
Input: "SEIU Local 1199"
│
▼
┌─────────────────────────────┐
│ Tokenizer │
│ ["SEIU", " ", "Local", " ", "1199"]
│ token_type: [word, space, word, space, number]
└─────────────────────────────┘
│
┌────┴────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ CharCNN │ │ Special │
│ (words) │ │ Embed │
└────┬─────┘ └────┬─────┘
└─────┬───────┘
▼
┌─────────────────────────────┐
│ Token Embeddings (64-dim) │
│ + is_number feature (16-dim)│
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Self-Attention (4 heads) │
└─────────────────────────────┘
│
┌──────┴──────┐
▼ ▼
┌─────────┐ ┌──────────────┐
│ Set │ │ BiLSTM │
│ Attn │ │ (512-dim) │
│ Pooling │ └──────┬───────┘
└────┬────┘ │
│ ▼
│ ┌───────────────────┐
│ │ + Aff Embedding │
│ │ Pointer Selection │
│ └─────────┬─────────┘
▼ ▼
Affiliation Designation
"SEIU" "1199"
Components
Character CNN (for word tokens)
- Character embedding: 16-dim
- Multi-scale 1D convolutions (kernel sizes 2, 3, 4, 5)
- Max pooling → 64-dim token embedding
- Typo-robust: handles misspellings gracefully
Special Token Embedding (for non-words)
- Lookup table for numbers, punctuation, spaces
- 64-dim embeddings
Token Features
is_number: Binary feature indicating numeric tokens- Combined with token embedding via learned 16-dim feature embedding
Self-Attention
- Multi-head attention (4 heads) over token sequence
- Allows tokens to attend to each other for context
Affiliation Classification
- Set attention pooling: learned weighted sum of token representations
- Linear classifier → affiliation label
Designation Selection (Pointer Network)
- BiLSTM processes contextualized token embeddings
- Concatenates with predicted affiliation embedding
- Scores each numeric token position
- Includes learnable "null" score for no designation
- Selects highest-scoring number token
Model Statistics
- Parameters: ~2.5M
- Inference: CPU or MPS (Apple Silicon)
- Model file: ~10MB
Performance
On held-out test set (2,413 examples):
- Affiliation accuracy: ~97%
- Designation accuracy: ~98%
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file labor_union_parser-0.1.0.tar.gz.
File metadata
- Download URL: labor_union_parser-0.1.0.tar.gz
- Upload date:
- Size: 10.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04bd88507fd01695c64d4055595c89555808563679cc010328e63c85e85d506d
|
|
| MD5 |
77ab8a9f828c1ab63fbbe5c83b08d8f6
|
|
| BLAKE2b-256 |
e0a69dcc3e1dc046e22d6de2e006f9636c23c23575fef1c4c01b6b3b29a4a5dd
|
Provenance
The following attestation bundles were made for labor_union_parser-0.1.0.tar.gz:
Publisher:
build-and-publish.yml on labordata/labor-union-parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
labor_union_parser-0.1.0.tar.gz -
Subject digest:
04bd88507fd01695c64d4055595c89555808563679cc010328e63c85e85d506d - Sigstore transparency entry: 789618554
- Sigstore integration time:
-
Permalink:
labordata/labor-union-parser@8b0773e3bfefb63410da94feed05adc46ed3e9dc -
Branch / Tag:
refs/heads/main - Owner: https://github.com/labordata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@8b0773e3bfefb63410da94feed05adc46ed3e9dc -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file labor_union_parser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: labor_union_parser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afaa1af1c29b240be51ec846f02a13f23b73cc8d268b8ae7aab7535626ba6c47
|
|
| MD5 |
89db89b4e061dcca8f538b7bcc635b34
|
|
| BLAKE2b-256 |
6dea539e31c97f2a273b83e7967698fd0d053cd9e6efd5f359275b6f11ec0c8e
|
Provenance
The following attestation bundles were made for labor_union_parser-0.1.0-py3-none-any.whl:
Publisher:
build-and-publish.yml on labordata/labor-union-parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
labor_union_parser-0.1.0-py3-none-any.whl -
Subject digest:
afaa1af1c29b240be51ec846f02a13f23b73cc8d268b8ae7aab7535626ba6c47 - Sigstore transparency entry: 789618555
- Sigstore integration time:
-
Permalink:
labordata/labor-union-parser@8b0773e3bfefb63410da94feed05adc46ed3e9dc -
Branch / Tag:
refs/heads/main - Owner: https://github.com/labordata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@8b0773e3bfefb63410da94feed05adc46ed3e9dc -
Trigger Event:
workflow_dispatch
-
Statement type: