Skip to main content

Extract affiliation and local designation from labor union names

Project description

Labor Union Parser

Extract affiliation and local designation from labor union name strings.

Given an input like "SEIU Local 1199", the parser returns:

  • Affiliation: SEIU (Service Employees International Union)
  • Designation: 1199 (local number)

Installation

pip install -e .

Usage

Command Line

# Basic usage
labor-union-parser "SEIU Local 1199"
# Output:
# Affiliation: SEIU
# Designation: 1199

# JSON output
labor-union-parser "Teamsters Local 705" --json
# Output: {"affiliation": "IBT", "designation": "705"}

Python API

from labor_union_parser import extract

result = extract("UAW Local 600")
print(result)
# {'affiliation': 'UAW', 'designation': '600'}

For batch processing:

from labor_union_parser import Extractor

extractor = Extractor()
results = extractor.extract_batch([
    "SEIU Local 1199",
    "Teamsters Local 705",
    "UAW Local 600",
])
# [{'affiliation': 'SEIU', 'designation': '1199'},
#  {'affiliation': 'IBT', 'designation': '705'},
#  {'affiliation': 'UAW', 'designation': '600'}]

For large datasets, use extract_all which yields results as a generator:

from labor_union_parser import Extractor

extractor = Extractor()

# Process large list with progress bar
for result in extractor.extract_all(union_names, show_progress=True):
    print(result)

# Adjust batch size for memory/speed tradeoff
results = list(extractor.extract_all(union_names, batch_size=512))

Training

Training data is in training/data/labeled_data.csv with columns:

  • text: Union name string
  • aff_abbr: Affiliation abbreviation (e.g., "SEIU", "IBT", "UAW")
  • desig_num: Local designation number
  • split: One of "train", "val", or "test"

To retrain the model:

pip install -e ".[train]"  # Install training dependencies
python training/train.py

The training script will:

  1. Load data from training/data/labeled_data.csv
  2. Train for 10 epochs with early stopping based on validation performance
  3. Save the best model to src/labor_union_parser/weights/char_cnn.pt

Training Data Statistics

Split Examples
Train 82,900
Val 2,296
Test 2,413

Model Architecture

The model uses a CharCNN architecture with pointer-based designation selection:

Input: "SEIU Local 1199"
         │
         ▼
┌─────────────────────────────┐
│  Tokenizer                  │
│  ["SEIU", " ", "Local", " ", "1199"]
│  token_type: [word, space, word, space, number]
└─────────────────────────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌──────────┐  ┌──────────┐
│ CharCNN  │  │ Special  │
│ (words)  │  │ Embed    │
└────┬─────┘  └────┬─────┘
     └─────┬───────┘
           ▼
┌─────────────────────────────┐
│  Token Embeddings (64-dim)  │
│  + is_number feature (16-dim)│
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│  Self-Attention (4 heads)   │
└─────────────────────────────┘
           │
    ┌──────┴──────┐
    ▼             ▼
┌─────────┐  ┌──────────────┐
│ Set     │  │ BiLSTM       │
│ Attn    │  │ (512-dim)    │
│ Pooling │  └──────┬───────┘
└────┬────┘         │
     │              ▼
     │     ┌───────────────────┐
     │     │ + Aff Embedding   │
     │     │ Pointer Selection │
     │     └─────────┬─────────┘
     ▼               ▼
Affiliation     Designation
  "SEIU"          "1199"

Components

Character CNN (for word tokens)

  • Character embedding: 16-dim
  • Multi-scale 1D convolutions (kernel sizes 2, 3, 4, 5)
  • Max pooling → 64-dim token embedding
  • Typo-robust: handles misspellings gracefully

Special Token Embedding (for non-words)

  • Lookup table for numbers, punctuation, spaces
  • 64-dim embeddings

Token Features

  • is_number: Binary feature indicating numeric tokens
  • Combined with token embedding via learned 16-dim feature embedding

Self-Attention

  • Multi-head attention (4 heads) over token sequence
  • Allows tokens to attend to each other for context

Affiliation Classification

  • Set attention pooling: learned weighted sum of token representations
  • Linear classifier → affiliation label

Designation Selection (Pointer Network)

  • BiLSTM processes contextualized token embeddings
  • Concatenates with predicted affiliation embedding
  • Scores each numeric token position
  • Includes learnable "null" score for no designation
  • Selects highest-scoring number token

Model Statistics

  • Parameters: ~2.5M
  • Inference: CPU or MPS (Apple Silicon)
  • Model file: ~10MB

Performance

On held-out test set (2,413 examples):

  • Affiliation accuracy: ~97%
  • Designation accuracy: ~98%

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labor_union_parser-0.1.0.tar.gz (10.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

labor_union_parser-0.1.0-py3-none-any.whl (10.9 MB view details)

Uploaded Python 3

File details

Details for the file labor_union_parser-0.1.0.tar.gz.

File metadata

  • Download URL: labor_union_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 10.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for labor_union_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 04bd88507fd01695c64d4055595c89555808563679cc010328e63c85e85d506d
MD5 77ab8a9f828c1ab63fbbe5c83b08d8f6
BLAKE2b-256 e0a69dcc3e1dc046e22d6de2e006f9636c23c23575fef1c4c01b6b3b29a4a5dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for labor_union_parser-0.1.0.tar.gz:

Publisher: build-and-publish.yml on labordata/labor-union-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file labor_union_parser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for labor_union_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afaa1af1c29b240be51ec846f02a13f23b73cc8d268b8ae7aab7535626ba6c47
MD5 89db89b4e061dcca8f538b7bcc635b34
BLAKE2b-256 6dea539e31c97f2a273b83e7967698fd0d053cd9e6efd5f359275b6f11ec0c8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for labor_union_parser-0.1.0-py3-none-any.whl:

Publisher: build-and-publish.yml on labordata/labor-union-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page