Extract affiliation and local designation from labor union names

Project description

Labor Union Parser

Match labor union name text to Office of Labor-Management Standards filing numbers.

Installation

pip install labor-union-parser

Usage

Python API

from labor_union_parser import Extractor

extractor = Extractor()
result = extractor.extract("SEIU Local 1199")
print(result)
# {'f_num': 31847,
#  'f_num_score': 0.9500725865364075,
#  'is_union': True,
#  'is_union_score': 0.9268560409545898,
#  'union_name': 'SERVICE EMPLOYEES',
#  'union_name_score': 0.9972871541976929}

For batch processing, use extract_batch which processes texts in parallel for better throughput:

from labor_union_parser import Extractor

extractor = Extractor()
results = extractor.extract_batch([
    "SEIU Local 1199",
    "Teamsters Local 705",
    "UAW Local 600",
])
# {'f_num': 31847,
#  'f_num_score': 0.950072705745697,
#  'is_union': True,
#  'is_union_score': 0.9268560409545898,
#  'union_name': 'SERVICE EMPLOYEES',
#  'union_name_score': 0.9972871541976929}
# {'f_num': 43508,
#  'f_num_score': 0.9926707744598389,
#  'is_union': True,
#  'is_union_score': 0.9246779680252075,
#  'union_name': 'TEAMSTERS',
#  'union_name_score': 0.9981544613838196}
# {'f_num': 13030,
#  'f_num_score': 0.993687093257904,
#  'is_union': True,
#  'is_union_score': 0.8813596367835999,
#  'union_name': 'AUTO WORKERS AFL-CIO',
#  'union_name_score': 0.99698406457901}

The batch_size parameter controls how many texts are processed at once (default: 256). Larger batches are faster but use more memory:

# Process 512 texts at a time
results = extractor.extract_batch(texts, batch_size=512)

For very large datasets, combine extract_batch with itertools.batched to process in chunks and avoid loading everything into memory:

import itertools
from labor_union_parser import Extractor

extractor = Extractor()

# Stream through a large file, processing 1000 at a time
with open("union_names.txt") as f:
    for chunk in itertools.batched(f, 1000):
        texts = [line.strip() for line in chunk]
        for result in extractor.extract_batch(texts):
            print(result["f_num"], result["union_name"])

Command Line

# Process CSV file
labor-union-parser unions.csv -c union_name -o results.csv

# Process from stdin
echo "SEIU Local 1199" | labor-union-parser --no-header

Output Fields

Field	Description
`is_union`	Whether the text is detected as a union name
`is_union_score`	Calibrated probability of being a union (0-1, Platt-scaled)
`union_name`	Predicted parent union name from the shared classification head
`union_name_score`	Softmax probability of the predicted `union_name` (0-1)
`f_num`	OLMS filing number of the best-matching gazetteer record
`f_num_score`	Softmax probability of best gazetteer match (0-1)

Training

Training data and scripts are in training/. The pipeline is orchestrated by the root Makefile:

pip install -e ".[train]"   # Install training dependencies

make data                   # Download opdr.db, generate gazetteer and training data
make train                  # Train ArcFace classifier and union detector
make evaluate               # Run evaluation
make all                    # Full pipeline (data + train)

Checked-in Data

training/data/labeled_data.csv — labeled union name examples
training/data/nonunion_examples.csv — non-union text examples
training/data/acronym_to_fullname.csv — union acronym mappings

Model Architecture

The model uses a two-stage pipeline:

Input: "SEIU Local 1199"
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Tokenizer                                        │
│  tokens: ["seiu", "local", "1199"]                │
│  is_num: [False, False, True]                     │
│  + FastText char n-gram hashes + Bloom number IDs │
└───────────────────────────────────────────────────┘
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Stage 1: Union Detection (Contrastive)           │
│                                                   │
│  FastText + Bloom + RoPE Transformer (2 layers)   │
│  → Mean pool → Projection → L2 normalize          │
│  → Cosine similarity to learned union prototype   │
│  → Platt scaling: sigmoid(a·sim + b)              │
│                                                   │
│  is_union_score = 0.99 → is_union = True          │
└───────────────────────────────────────────────────┘
              │
              ▼ (always runs)
┌───────────────────────────────────────────────────┐
│  Stage 2: Factored ArcFace Classifier             │
│                                                   │
│  FastText + Bloom + RoPE Transformer (3 layers)   │
│  → Mean pool → L2 normalize                       │
│                                                   │
│  Score against ~38K factored prototypes:          │
│  prototype = W_union + W_desig + bloom(num)       │
│            + W_prefix + W_suffix + W_fnum         │
│  (~17K trained + ~18K zero-shot from gazetteer)   │
│                                                   │
│  Match: SERVICE EMPLOYEES LU 1199 → f_num=31847   │
└───────────────────────────────────────────────────┘
              │
              ▼
Output: {is_union: True, union_name: "SERVICE EMPLOYEES",
         f_num: 31847, f_num_score: 0.96, ...}

Factored Prototypes:

Each f_num's prototype is the sum of learned field embeddings:

prototype = W_union[u] + W_desig_name[d] + bloom(desig_num)
          + W_prefix[p] + W_suffix[s] + W_fnum[f]

This additive structure means the model learns separate representations for each field. At inference, scoring is a single matrix multiply against ~38K pre-computed prototype vectors covering ~35K f_nums (~17K trained classes + ~18K zero-shot from gazetteer with W_fnum = 0; some f_nums have multiple record variants).

Zero-shot prototypes: For gazetteer f_nums without training data, prototypes are built from field embeddings alone. During training, these are included as frozen distractors in the ArcFace softmax, teaching the model to distinguish trained classes from similar zero-shot prototypes. W_fnum is L2-regularized to keep trained prototypes close to their zero-shot versions.

Performance

End-to-end on held-out test data (4,437 examples scored against the full ~35K-f_num gazetteer):

Metric	Score
Accuracy	97.8%
is_union accuracy	99.2% (4402/4437)
f_num accuracy (union examples)	98.3% (3804/3868)
f_num accuracy (in-vocab only)	98.3%
union_name accuracy	97.8% (4665/4771)
Wrong match (union, wrong f_num)	64
False negatives (union missed)	8
False positives (non-union matched)	27

Project details

Release history Release notifications | RSS feed

This version

2.0.0

Apr 29, 2026

0.2.0

Jan 12, 2026

0.1.0

Jan 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labor_union_parser-2.0.0.tar.gz (77.0 MB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

labor_union_parser-2.0.0-py3-none-any.whl (77.0 MB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file labor_union_parser-2.0.0.tar.gz.

File metadata

Download URL: labor_union_parser-2.0.0.tar.gz
Upload date: Apr 29, 2026
Size: 77.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for labor_union_parser-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`74086a6d7bdb243612fa300d06cd65237e10401a58950eb4fa8f9d35536af1d0`
MD5	`fa7be5d2fa3534ffacfba5cdc28ae266`
BLAKE2b-256	`4196069643decb3aa8362a01a2b26777b2ab69965fa168cf3849ba1bb8ddcc88`

See more details on using hashes here.

Provenance

The following attestation bundles were made for labor_union_parser-2.0.0.tar.gz:

Publisher: build-and-publish.yml on labordata/labor-union-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: labor_union_parser-2.0.0.tar.gz
- Subject digest: 74086a6d7bdb243612fa300d06cd65237e10401a58950eb4fa8f9d35536af1d0
- Sigstore transparency entry: 1401397986
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: labordata/labor-union-parser@bd0f74d187160ba66fff692b3d6f0385281eb5c1
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/labordata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build-and-publish.yml@bd0f74d187160ba66fff692b3d6f0385281eb5c1
- Trigger Event: release

File details

Details for the file labor_union_parser-2.0.0-py3-none-any.whl.

File metadata

Download URL: labor_union_parser-2.0.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 77.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for labor_union_parser-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad1d262c3d00ab80d163d43839ed4aeb1af79f122a2c7ab011003d7cc6cf2af7`
MD5	`f4b999bbfc83d2b0d1d31372ad847975`
BLAKE2b-256	`92c2d6b1f7cb3f1425be565bbc60db34dcf991e0b789c9b4a4d8d68a7c445183`

See more details on using hashes here.

Provenance

The following attestation bundles were made for labor_union_parser-2.0.0-py3-none-any.whl:

Publisher: build-and-publish.yml on labordata/labor-union-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: labor_union_parser-2.0.0-py3-none-any.whl
- Subject digest: ad1d262c3d00ab80d163d43839ed4aeb1af79f122a2c7ab011003d7cc6cf2af7
- Sigstore transparency entry: 1401398082
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: labordata/labor-union-parser@bd0f74d187160ba66fff692b3d6f0385281eb5c1
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/labordata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build-and-publish.yml@bd0f74d187160ba66fff692b3d6f0385281eb5c1
- Trigger Event: release

labor-union-parser 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Labor Union Parser

Installation

Usage

Python API

Command Line

Output Fields

Training

Checked-in Data

Model Architecture

Performance

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance