Extract affiliation and local designation from labor union names

Project description

Labor Union Parser

Extract affiliation and local designation from labor union name strings.

Given an input like "SEIU Local 1199", the parser returns:

is_union: True (detected as a union)
union_score: 0.999 (confidence score for union detection)
affiliation: SEIU (Service Employees International Union)
affiliation_unrecognized: False (True if affiliation couldn't be matched)
aff_score: 0.997 (confidence score for affiliation)
designation: 1199 (local number)

Installation

pip install labor-union-parser

Usage

Python API

from labor_union_parser import Extractor

extractor = Extractor()
result = extractor.extract("SEIU Local 1199")
print(result)
# {'is_union': True, 'union_score': 0.999, 'affiliation': 'SEIU',
#  'affiliation_unrecognized': False, 'designation': '1199', 'aff_score': 0.997}

For batch processing, use extract_batch which processes texts in parallel for better throughput:

from labor_union_parser import Extractor

extractor = Extractor()
results = extractor.extract_batch([
    "SEIU Local 1199",
    "Teamsters Local 705",
    "UAW Local 600",
])
# Returns list of result dicts, one per input text

The batch_size parameter controls how many texts are processed at once (default: 256). Larger batches are faster but use more memory:

# Process 512 texts at a time
results = extractor.extract_batch(texts, batch_size=512)

For very large datasets, combine extract_batch with itertools.batched to process in chunks and avoid loading everything into memory:

import itertools
from labor_union_parser import Extractor

extractor = Extractor()

# Stream through a large file, processing 1000 at a time
with open("union_names.txt") as f:
    for chunk in itertools.batched(f, 1000):
        texts = [line.strip() for line in chunk]
        for result in extractor.extract_batch(texts):
            print(result["affiliation"], result["designation"])

Filing Number Lookup

Look up OLMS filing numbers for a given affiliation and designation:

from labor_union_parser import lookup_fnum

fnums = lookup_fnum("SEIU", "1199")
# [31847, 69557, 508557, ...]

Command Line

# Process CSV file
labor-union-parser unions.csv -c union_name -o results.csv

# Process from stdin
echo "SEIU Local 1199" | labor-union-parser --no-header
# text,pred_is_union,pred_aff,pred_unknown,pred_desig,pred_union_score,pred_fnum,pred_fnum_multiple
# SEIU Local 1199,True,SEIU,False,1199,0.9992,"[31847, 69557, ...]",True

Output Fields

Field	Description
`is_union`	Whether the text is detected as a union name
`union_score`	Similarity score to union centroid (0-1)
`affiliation`	Predicted affiliation abbreviation (e.g., "SEIU", "IBT") or `None`
`affiliation_unrecognized`	`True` if detected as union but affiliation unrecognized
`designation`	Extracted local number (e.g., "1199") or empty string
`aff_score`	Similarity to nearest affiliation centroid (higher = more confident)

Training

Training data is in training/data/labeled_data.csv with columns:

text: Union name string
aff_abbr: Affiliation abbreviation (e.g., "SEIU", "IBT", "UAW")
desig_num: Local designation number

To retrain the model:

pip install -e ".[train]"  # Install training dependencies
python -m training.train              # Train all stages
python -m training.train --stage 1    # Train only union detector
python -m training.train --stage 2    # Train only affiliation classifier
python -m training.train --stage 3    # Train only designation extractor

Model Architecture

The model uses a three-stage contrastive extraction pipeline:

Input: "SEIU Local 1199"
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Tokenizer                                        │
│  tokens: ["SEIU", " ", "Local", " ", "1199"]      │
│  token_type: [word, space, word, space, number]   │
└───────────────────────────────────────────────────┘
              │
              ▼
┌───────────────────────────────────────────────────┐
│  CharCNN (shared across stages)                   │
│                                                   │
│  For each token: chars → char embeddings →        │
│  parallel CNNs (1,2,3-grams) → max pool →         │
│  highway layer → 64-dim token embedding           │
│                                                   │
│  Typo-robust: "SEIU" ≈ "SIEU" ≈ "S.E.I.U."        │
└───────────────────────────────────────────────────┘
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Stage 1: Union Detection (Contrastive)           │
│                                                   │
│  Token embeddings + is_number embedding →         │
│  Cross-attention (learned query) → Projection →   │
│  Similarity to union centroid                     │
│                                                   │
│  score = 0.999 → is_union = True                  │
└───────────────────────────────────────────────────┘
              │
              ▼ (if is_union)
┌───────────────────────────────────────────────────┐
│  Stage 2: Affiliation (Nearest Centroid)          │
│                                                   │
│  Token embeddings + is_number embedding →         │
│  Cross-attention (learned query) → Projection →   │
│  Similarity to affiliation centroids              │
│                                                   │
│  Nearest: SEIU (score = 0.997)                    │
└───────────────────────────────────────────────────┘
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Stage 3: Designation (Pointer Network)           │
│                                                   │
│  Token embeddings + Transformer encoder →         │
│  BiLSTM + affiliation embedding → pointer scores  │
│                                                   │
│  Points to: "1199"                                │
└───────────────────────────────────────────────────┘
              │
              ▼
Output: {is_union: True, union_score: 0.999, affiliation: "SEIU",
         affiliation_unrecognized: False, aff_score: 0.997, designation: "1199"}

CharCNN

Character-level CNN that computes token embeddings, shared across all stages.

Character embedding: 16-dim lookup for ~50 chars (letters, digits, punctuation)
Parallel CNNs: 1-gram (32 filters), 2-gram (64 filters), 3-gram (128 filters)
Pooling: Max-pool over character dimension → 224-dim
Highway layer: Gated transformation for non-linearity
Projection: Linear layer → 64-dim token embedding
Typo-robust: Similar spellings produce similar embeddings

Stage 1: Union Detection

Contrastive learning to distinguish union names from non-union text.

Input: CharCNN token embeddings + is_number embedding (8-dim)
Cross-attention: Learned query attends over token sequence
Projection: 2-layer MLP (72 → 128 → 64) with L2 normalization
Training: One-class contrastive loss (union examples form positive pairs)
Inference: Cosine similarity to learned union centroid
Threshold: Similarity ≥ 0.5 → is_union = True

Stage 2: Affiliation Classification

Nearest-centroid classification in contrastive embedding space.

Input: CharCNN token embeddings + is_number embedding (8-dim)
Cross-attention: Learned query attends over token sequence
Projection: 2-layer MLP (72 → 128 → 64) with L2 normalization
Training: Supervised contrastive loss (same-affiliation = positive pairs)
Inference: Cosine similarity to each affiliation centroid
Threshold: Best score < 0.80 → affiliation_unrecognized = True

Stage 3: Designation Extraction

Pointer network that selects the correct local number token.

Input: CharCNN token embeddings + special token embeddings (numbers, punct)
Context: Transformer encoder (3 layers, 4 heads)
Selection: BiLSTM + affiliation embedding → score each number token
Output: Highest-scoring number token, or empty if no designation

Performance

On labeled data (94,308 examples with known affiliations):

Metric	All	Non-None Predictions
Affiliation accuracy	99.0%	99.7%
Joint accuracy	98.9%	99.5%

Designation accuracy: 99.9%
Only 0.7% of predictions return None (unrecognized affiliation)

Project details

Release history Release notifications | RSS feed

This version

0.2.0

Jan 12, 2026

0.1.0

Jan 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labor_union_parser-0.2.0.tar.gz (22.8 MB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

labor_union_parser-0.2.0-py3-none-any.whl (22.8 MB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file labor_union_parser-0.2.0.tar.gz.

File metadata

Download URL: labor_union_parser-0.2.0.tar.gz
Upload date: Jan 12, 2026
Size: 22.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for labor_union_parser-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`29c37d622d99001cfdfde1d0dad837d78202113d583e619e190f4ad9228f6a84`
MD5	`c8ec4401620c3ddfe391c500e29af350`
BLAKE2b-256	`1ac3013cba437c4a6db1123040fda7c83476b3840e1a7deef3ca1559b116c3c1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for labor_union_parser-0.2.0.tar.gz:

Publisher: build-and-publish.yml on labordata/labor-union-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: labor_union_parser-0.2.0.tar.gz
- Subject digest: 29c37d622d99001cfdfde1d0dad837d78202113d583e619e190f4ad9228f6a84
- Sigstore transparency entry: 815043949
- Sigstore integration time: Jan 12, 2026
Source repository:
- Permalink: labordata/labor-union-parser@0c53d45409be2642df3cf7c0876ce4f694c9296f
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/labordata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build-and-publish.yml@0c53d45409be2642df3cf7c0876ce4f694c9296f
- Trigger Event: release

File details

Details for the file labor_union_parser-0.2.0-py3-none-any.whl.

File metadata

Download URL: labor_union_parser-0.2.0-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 22.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for labor_union_parser-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c7b8bd7e4fceb38b8521f51813888c0dd2dfef462dd52b34ccf157402a13154b`
MD5	`6c64a25db686f65282527070954c6f34`
BLAKE2b-256	`0b72b6f31ee998d6d3c55ef8c4c9fbc9145d670bcdd7a160f9f17ecd7164192e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for labor_union_parser-0.2.0-py3-none-any.whl:

Publisher: build-and-publish.yml on labordata/labor-union-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: labor_union_parser-0.2.0-py3-none-any.whl
- Subject digest: c7b8bd7e4fceb38b8521f51813888c0dd2dfef462dd52b34ccf157402a13154b
- Sigstore transparency entry: 815043951
- Sigstore integration time: Jan 12, 2026
Source repository:
- Permalink: labordata/labor-union-parser@0c53d45409be2642df3cf7c0876ce4f694c9296f
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/labordata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build-and-publish.yml@0c53d45409be2642df3cf7c0876ce4f694c9296f
- Trigger Event: release

labor-union-parser 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Labor Union Parser

Installation

Usage

Python API

Filing Number Lookup

Command Line

Output Fields

Training

Model Architecture

CharCNN

Stage 1: Union Detection

Stage 2: Affiliation Classification

Stage 3: Designation Extraction

Performance

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance