Lightweight tools for working with GLEIF LEI data: preprocess, load, fuzzy query.

Project description

GoodGLEIF

Lightweight tools for working with GLEIF LEI data: preprocess, load, and fuzzy query company information.

Features

Smart Data Loading: Automatically finds and loads GLEIF data from multiple sources
Fuzzy Matching: Advanced company name matching with multiple strategies
Multiple Matching Strategies: Canonical, brief, and best matching approaches

Installation

pip install goodgleif

Quick Start

Basic Company Matching

from goodgleif.companymatcher import CompanyMatcher

# Initialize (data loads automatically on first match)
gg = CompanyMatcher()

# Search for companies (loads data automatically if needed)
matches = gg.match_best("Apple", limit=3, min_score=70)

print(f"Searching for: 'Apple'")
print("-" * 40)

for i, match in enumerate(matches, 1):
    print(f"{i}. {match['original_name']}")
    print(f"   Score: {match['canonical_name']}")
    print(f"   LEI: {match['lei']}")
    print(f"   Country: {match['country']}")
    print()

Simple Usage (No Path Required)

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()  # Uses default classified data

print("2. Searching for companies (data loads automatically)...")

# Search for multiple companies
queries = ["Apple", "Microsoft", "Tesla", "Goldman Sachs"]

for query in queries:
    print(f"\nSearching for: '{query}'")
    matches = gg.match_best(query, limit=3, min_score=80)
    
    if matches:
        for i, match in enumerate(matches, 1):
            print(f"  {i}. {match['original_name']} (Score: {match['score']:.1f})")
            print(f"     LEI: {match['lei']} | Country: {match['country']}")
    else:
        print(f"  No matches found for '{query}'")

print(f"\n4. System automatically used the best available data source!")
print("   - Partitioned files (if available)")
print("   - Single parquet file (fallback)")
print("   - Package resources (if distributed)")
print("   - Helpful error messages (if missing)")

Category-Specific Loading

Load specific industry categories for focused matching:

from goodgleif.companymatcher import CompanyMatcher

# Load only mining companies
mining_matcher = CompanyMatcher(category='obviously_mining')
mining_matches = mining_matcher.match_best("Gold Mining Corp")

# Load only financial companies  
financial_matcher = CompanyMatcher(category='financial')
financial_matches = financial_matcher.match_best("Goldman Sachs")

# Load metals and mining companies
metals_matcher = CompanyMatcher(category='metals_and_mining')
metals_matches = metals_matcher.match_best("Steel Works Inc")

Available Categories

# See all available categories
matcher = CompanyMatcher()
matcher.show_available_categories()

# List available categories programmatically
categories = CompanyMatcher.list_available_categories()
print(f"Available categories: {categories}")

Category Loading Benefits

Faster Loading: Only load the data you need
Focused Results: Search within specific industries
Smaller Memory Usage: Reduced memory footprint
Better Performance: Faster matching for targeted searches

Matching Strategies

GoodGLEIF offers three different matching strategies:

Canonical Matching

Preserves legal suffixes like "Inc.", "Corp.", "LLC":

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

# Canonical matching (preserves legal suffixes)
canonical_matches = gg.match_canonical("Apple Inc", limit=2)

print(f"Canonical matching:")
for match in canonical_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Brief Matching

Removes legal suffixes for broader matching:

# Brief matching (removes legal suffixes)
brief_matches = gg.match_brief("Apple Inc", limit=2)

print(f"Brief matching:")
for match in brief_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Best Matching

Combines both strategies for optimal results:

# Best matching (combines both)
best_matches = gg.match_best("Apple Inc", limit=2)

print(f"Best matching:")
for match in best_matches:
    canonical_score = match.get('canonical_score', 0)
    brief_score = match.get('brief_score', 0)
    print(f"  {match['original_name']} (Canonical: {canonical_score}, Brief: {brief_score})")

Score Threshold Analysis

Analyze how different score thresholds affect your results:

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

query = "Apple"
thresholds = [90, 80, 70, 60]

print(f"Score threshold comparison for: '{query}'")
print("=" * 50)

for min_score in thresholds:
    matches = gg.match_best(query, limit=3, min_score=min_score)
    print(f"\nMin Score {min_score}: {len(matches)} matches")
    for match in matches:
        print(f"  {match['original_name']} (Score: {match['score']})")

API Reference

CompanyMatcher

The main class for company matching operations.

Constructor

CompanyMatcher(parquet_path=None, category=None): Initialize with optional path or specific category

Methods

match_best(query, limit=3, min_score=70): Find best matches using combined strategy (loads data automatically)
match_canonical(query, limit=3, min_score=70): Find matches preserving legal suffixes (loads data automatically)
match_brief(query, limit=3, min_score=70): Find matches removing legal suffixes (loads data automatically)
show_available_categories(): Display all available categories with descriptions
list_available_categories(): Return list of available category names

Parameters

query: Company name to search for
limit: Maximum number of results to return
min_score: Minimum score threshold (0-100)

Returns

List of match dictionaries containing:

original_name: Company name from GLEIF database
score: Match confidence score
lei: Legal Entity Identifier
country: Country code
canonical_name: Standardized company name

Data Sources

GoodGLEIF automatically detects and uses the best available data source:

Partitioned Files: GitHub-friendly partitioned parquet files (preferred)
Single Parquet File: Fallback to single large parquet file
Package Resources: Embedded data if distributed
Error Messages: Helpful guidance if data is missing

Examples

All examples are available in the package and can be found at: https://github.com/microprediction/goodgleif/tree/main/goodgleif/examples

All examples are available as callable functions:

# Run examples directly
from goodgleif.examples.basic_matching_example import basic_matching_example
from goodgleif.examples.matching_strategies_example import matching_strategies_example
from goodgleif.examples.score_thresholds_example import score_thresholds_example
from goodgleif.examples.simple_usage_example import simple_usage_example
from goodgleif.examples.exchange_matching_example import exchange_matching_example

# Run with custom parameters
matches = basic_matching_example("Tesla", limit=5, min_score=85)
strategies = matching_strategies_example("Microsoft Corporation")

Development

Running Examples

# Run individual examples
python -m goodgleif.examples.basic_matching_example
python -m goodgleif.examples.simple_usage_example
python -m goodgleif.examples.comprehensive_example
python -m goodgleif.examples.lei_extraction_example
python -m goodgleif.examples.matching_strategies_example
python -m goodgleif.examples.score_thresholds_example
python -m goodgleif.examples.exchange_matching_example

# Run all examples test suite
python -m goodgleif.examples.run_all_examples

Testing

# Run all tests
pytest

# Run example tests specifically
pytest tests/goodgleif/examples/

Requirements

Python >= 3.9
pandas >= 2.1
pyarrow >= 14.0
rapidfuzz >= 3.6
platformdirs >= 4.2
pyyaml >= 6.0.1

License

See LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Support

For issues and questions, please use the GitHub issue tracker.

Project details

Release history Release notifications | RSS feed

0.5.4

Oct 6, 2025

0.5.2

Oct 6, 2025

0.4.3

Oct 6, 2025

This version

0.4.2

Oct 6, 2025

0.4.1

Oct 6, 2025

0.3.2

Oct 6, 2025

0.1.1

Oct 6, 2025

0.0.9

Oct 6, 2025

0.0.8

Oct 6, 2025

0.0.7

Oct 6, 2025

0.0.6

Oct 3, 2025

0.0.5

Oct 3, 2025

0.0.3

Oct 3, 2025

0.0.2

Oct 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodgleif-0.4.2.tar.gz (38.7 kB view details)

Uploaded Oct 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goodgleif-0.4.2-py3-none-any.whl (45.6 kB view details)

Uploaded Oct 6, 2025 Python 3

File details

Details for the file goodgleif-0.4.2.tar.gz.

File metadata

Download URL: goodgleif-0.4.2.tar.gz
Upload date: Oct 6, 2025
Size: 38.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`262cc0c45fc1efe55e35b1ef220663798f29a9d86090a1e2dc42475bb5af66d2`
MD5	`805ac5e83b0823e8a8b7bf21e600a845`
BLAKE2b-256	`69a0ddd9dfcef458b3bd21dab1ffc2e34ad114aae48e48c089cfa1e6d31e1145`

See more details on using hashes here.

File details

Details for the file goodgleif-0.4.2-py3-none-any.whl.

File metadata

Download URL: goodgleif-0.4.2-py3-none-any.whl
Upload date: Oct 6, 2025
Size: 45.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3b7c620979baf51ae53f340c5e51b61b5f482e3968432f775879360d5a27632`
MD5	`06d1490acaca3e2fe042f4b293c8e1db`
BLAKE2b-256	`1bd1c20814468ea6cebb2bbdcd360bc04795a2cd7b119364faf972dc589a989b`

See more details on using hashes here.

goodgleif 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GoodGLEIF

Features

Installation

Quick Start

Basic Company Matching

Simple Usage (No Path Required)

Category-Specific Loading

Available Categories

Category Loading Benefits

Matching Strategies

Canonical Matching

Brief Matching

Best Matching

Score Threshold Analysis

API Reference

CompanyMatcher

Constructor

Methods

Parameters

Returns

Data Sources

Examples

Development

Running Examples

Testing

Requirements

License

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes