Extract affiliation and local designation from labor union names
Project description
Labor Union Parser
Extract affiliation and local designation from labor union name strings.
Given an input like "SEIU Local 1199", the parser returns:
is_union: True (detected as a union)union_score: 0.999 (confidence score for union detection)affiliation: SEIU (Service Employees International Union)affiliation_unrecognized: False (True if affiliation couldn't be matched)aff_score: 0.997 (confidence score for affiliation)designation: 1199 (local number)
Installation
pip install labor-union-parser
Usage
Python API
from labor_union_parser import Extractor
extractor = Extractor()
result = extractor.extract("SEIU Local 1199")
print(result)
# {'is_union': True, 'union_score': 0.999, 'affiliation': 'SEIU',
# 'affiliation_unrecognized': False, 'designation': '1199', 'aff_score': 0.997}
For batch processing, use extract_batch which processes texts in parallel for better throughput:
from labor_union_parser import Extractor
extractor = Extractor()
results = extractor.extract_batch([
"SEIU Local 1199",
"Teamsters Local 705",
"UAW Local 600",
])
# Returns list of result dicts, one per input text
The batch_size parameter controls how many texts are processed at once (default: 256). Larger batches are faster but use more memory:
# Process 512 texts at a time
results = extractor.extract_batch(texts, batch_size=512)
For very large datasets, combine extract_batch with itertools.batched to process in chunks and avoid loading everything into memory:
import itertools
from labor_union_parser import Extractor
extractor = Extractor()
# Stream through a large file, processing 1000 at a time
with open("union_names.txt") as f:
for chunk in itertools.batched(f, 1000):
texts = [line.strip() for line in chunk]
for result in extractor.extract_batch(texts):
print(result["affiliation"], result["designation"])
Filing Number Lookup
Look up OLMS filing numbers for a given affiliation and designation:
from labor_union_parser import lookup_fnum
fnums = lookup_fnum("SEIU", "1199")
# [31847, 69557, 508557, ...]
Command Line
# Process CSV file
labor-union-parser unions.csv -c union_name -o results.csv
# Process from stdin
echo "SEIU Local 1199" | labor-union-parser --no-header
# text,pred_is_union,pred_aff,pred_unknown,pred_desig,pred_union_score,pred_fnum,pred_fnum_multiple
# SEIU Local 1199,True,SEIU,False,1199,0.9992,"[31847, 69557, ...]",True
Output Fields
| Field | Description |
|---|---|
is_union |
Whether the text is detected as a union name |
union_score |
Similarity score to union centroid (0-1) |
affiliation |
Predicted affiliation abbreviation (e.g., "SEIU", "IBT") or None |
affiliation_unrecognized |
True if detected as union but affiliation unrecognized |
designation |
Extracted local number (e.g., "1199") or empty string |
aff_score |
Similarity to nearest affiliation centroid (higher = more confident) |
Training
Training data is in training/data/labeled_data.csv with columns:
text: Union name stringaff_abbr: Affiliation abbreviation (e.g., "SEIU", "IBT", "UAW")desig_num: Local designation number
To retrain the model:
pip install -e ".[train]" # Install training dependencies
python -m training.train # Train all stages
python -m training.train --stage 1 # Train only union detector
python -m training.train --stage 2 # Train only affiliation classifier
python -m training.train --stage 3 # Train only designation extractor
Model Architecture
The model uses a three-stage contrastive extraction pipeline:
Input: "SEIU Local 1199"
│
▼
┌───────────────────────────────────────────────────┐
│ Tokenizer │
│ tokens: ["SEIU", " ", "Local", " ", "1199"] │
│ token_type: [word, space, word, space, number] │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ CharCNN (shared across stages) │
│ │
│ For each token: chars → char embeddings → │
│ parallel CNNs (1,2,3-grams) → max pool → │
│ highway layer → 64-dim token embedding │
│ │
│ Typo-robust: "SEIU" ≈ "SIEU" ≈ "S.E.I.U." │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Stage 1: Union Detection (Contrastive) │
│ │
│ Token embeddings + is_number embedding → │
│ Cross-attention (learned query) → Projection → │
│ Similarity to union centroid │
│ │
│ score = 0.999 → is_union = True │
└───────────────────────────────────────────────────┘
│
▼ (if is_union)
┌───────────────────────────────────────────────────┐
│ Stage 2: Affiliation (Nearest Centroid) │
│ │
│ Token embeddings + is_number embedding → │
│ Cross-attention (learned query) → Projection → │
│ Similarity to affiliation centroids │
│ │
│ Nearest: SEIU (score = 0.997) │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Stage 3: Designation (Pointer Network) │
│ │
│ Token embeddings + Transformer encoder → │
│ BiLSTM + affiliation embedding → pointer scores │
│ │
│ Points to: "1199" │
└───────────────────────────────────────────────────┘
│
▼
Output: {is_union: True, union_score: 0.999, affiliation: "SEIU",
affiliation_unrecognized: False, aff_score: 0.997, designation: "1199"}
CharCNN
Character-level CNN that computes token embeddings, shared across all stages.
- Character embedding: 16-dim lookup for ~50 chars (letters, digits, punctuation)
- Parallel CNNs: 1-gram (32 filters), 2-gram (64 filters), 3-gram (128 filters)
- Pooling: Max-pool over character dimension → 224-dim
- Highway layer: Gated transformation for non-linearity
- Projection: Linear layer → 64-dim token embedding
- Typo-robust: Similar spellings produce similar embeddings
Stage 1: Union Detection
Contrastive learning to distinguish union names from non-union text.
- Input: CharCNN token embeddings + is_number embedding (8-dim)
- Cross-attention: Learned query attends over token sequence
- Projection: 2-layer MLP (72 → 128 → 64) with L2 normalization
- Training: One-class contrastive loss (union examples form positive pairs)
- Inference: Cosine similarity to learned union centroid
- Threshold: Similarity ≥ 0.5 → is_union = True
Stage 2: Affiliation Classification
Nearest-centroid classification in contrastive embedding space.
- Input: CharCNN token embeddings + is_number embedding (8-dim)
- Cross-attention: Learned query attends over token sequence
- Projection: 2-layer MLP (72 → 128 → 64) with L2 normalization
- Training: Supervised contrastive loss (same-affiliation = positive pairs)
- Inference: Cosine similarity to each affiliation centroid
- Threshold: Best score < 0.80 → affiliation_unrecognized = True
Stage 3: Designation Extraction
Pointer network that selects the correct local number token.
- Input: CharCNN token embeddings + special token embeddings (numbers, punct)
- Context: Transformer encoder (3 layers, 4 heads)
- Selection: BiLSTM + affiliation embedding → score each number token
- Output: Highest-scoring number token, or empty if no designation
Performance
On labeled data (94,308 examples with known affiliations):
| Metric | All | Non-None Predictions |
|---|---|---|
| Affiliation accuracy | 99.0% | 99.7% |
| Joint accuracy | 98.9% | 99.5% |
- Designation accuracy: 99.9%
- Only 0.7% of predictions return None (unrecognized affiliation)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file labor_union_parser-0.2.0.tar.gz.
File metadata
- Download URL: labor_union_parser-0.2.0.tar.gz
- Upload date:
- Size: 22.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29c37d622d99001cfdfde1d0dad837d78202113d583e619e190f4ad9228f6a84
|
|
| MD5 |
c8ec4401620c3ddfe391c500e29af350
|
|
| BLAKE2b-256 |
1ac3013cba437c4a6db1123040fda7c83476b3840e1a7deef3ca1559b116c3c1
|
Provenance
The following attestation bundles were made for labor_union_parser-0.2.0.tar.gz:
Publisher:
build-and-publish.yml on labordata/labor-union-parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
labor_union_parser-0.2.0.tar.gz -
Subject digest:
29c37d622d99001cfdfde1d0dad837d78202113d583e619e190f4ad9228f6a84 - Sigstore transparency entry: 815043949
- Sigstore integration time:
-
Permalink:
labordata/labor-union-parser@0c53d45409be2642df3cf7c0876ce4f694c9296f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/labordata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@0c53d45409be2642df3cf7c0876ce4f694c9296f -
Trigger Event:
release
-
Statement type:
File details
Details for the file labor_union_parser-0.2.0-py3-none-any.whl.
File metadata
- Download URL: labor_union_parser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7b8bd7e4fceb38b8521f51813888c0dd2dfef462dd52b34ccf157402a13154b
|
|
| MD5 |
6c64a25db686f65282527070954c6f34
|
|
| BLAKE2b-256 |
0b72b6f31ee998d6d3c55ef8c4c9fbc9145d670bcdd7a160f9f17ecd7164192e
|
Provenance
The following attestation bundles were made for labor_union_parser-0.2.0-py3-none-any.whl:
Publisher:
build-and-publish.yml on labordata/labor-union-parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
labor_union_parser-0.2.0-py3-none-any.whl -
Subject digest:
c7b8bd7e4fceb38b8521f51813888c0dd2dfef462dd52b34ccf157402a13154b - Sigstore transparency entry: 815043951
- Sigstore integration time:
-
Permalink:
labordata/labor-union-parser@0c53d45409be2642df3cf7c0876ce4f694c9296f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/labordata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@0c53d45409be2642df3cf7c0876ce4f694c9296f -
Trigger Event:
release
-
Statement type: