Skip to main content

Lightweight tools for working with GLEIF LEI data: preprocess, load, fuzzy query.

Project description

GoodGLEIF

Lightweight tools for working with GLEIF LEI data: preprocess, load, and fuzzy query company information.

Features

  • Smart Data Loading: Automatically finds and loads GLEIF data from multiple sources
  • Fuzzy Matching: Advanced company name matching with multiple strategies
  • Multiple Matching Strategies: Canonical, brief, and best matching approaches

Installation

pip install goodgleif

Quick Start

Basic Company Matching

from goodgleif.companymatcher import CompanyMatcher

# Initialize (data loads automatically on first match)
gg = CompanyMatcher()

# Search for companies (loads data automatically if needed)
matches = gg.match_best("Apple", limit=3, min_score=70)

print(f"Searching for: 'Apple'")
print("-" * 40)

for i, match in enumerate(matches, 1):
    print(f"{i}. {match['original_name']}")
    print(f"   Score: {match['canonical_name']}")
    print(f"   LEI: {match['lei']}")
    print(f"   Country: {match['country']}")
    print()

Simple Usage (No Path Required)

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()  # Uses default classified data

print("2. Searching for companies (data loads automatically)...")

# Search for multiple companies
queries = ["Apple", "Microsoft", "Tesla", "Goldman Sachs"]

for query in queries:
    print(f"\nSearching for: '{query}'")
    matches = gg.match_best(query, limit=3, min_score=80)
    
    if matches:
        for i, match in enumerate(matches, 1):
            print(f"  {i}. {match['original_name']} (Score: {match['score']:.1f})")
            print(f"     LEI: {match['lei']} | Country: {match['country']}")
    else:
        print(f"  No matches found for '{query}'")

print(f"\n4. System automatically used the best available data source!")
print("   - Partitioned files (if available)")
print("   - Single parquet file (fallback)")
print("   - Package resources (if distributed)")
print("   - Helpful error messages (if missing)")

Category-Specific Loading

Load specific industry categories for focused matching:

from goodgleif.companymatcher import CompanyMatcher

# Load only mining companies
mining_matcher = CompanyMatcher(category='obviously_mining')
mining_matches = mining_matcher.match_best("Gold Mining Corp")

# Load only financial companies  
financial_matcher = CompanyMatcher(category='financial')
financial_matches = financial_matcher.match_best("Goldman Sachs")

# Load metals and mining companies
metals_matcher = CompanyMatcher(category='metals_and_mining')
metals_matches = metals_matcher.match_best("Steel Works Inc")

Available Categories

# See all available categories
matcher = CompanyMatcher()
matcher.show_available_categories()

# List available categories programmatically
categories = CompanyMatcher.list_available_categories()
print(f"Available categories: {categories}")

Category Loading Benefits

  • Faster Loading: Only load the data you need
  • Focused Results: Search within specific industries
  • Smaller Memory Usage: Reduced memory footprint
  • Better Performance: Faster matching for targeted searches

Matching Strategies

GoodGLEIF offers three different matching strategies:

Canonical Matching

Preserves legal suffixes like "Inc.", "Corp.", "LLC":

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

# Canonical matching (preserves legal suffixes)
canonical_matches = gg.match_canonical("Apple Inc", limit=2)

print(f"Canonical matching:")
for match in canonical_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Brief Matching

Removes legal suffixes for broader matching:

# Brief matching (removes legal suffixes)
brief_matches = gg.match_brief("Apple Inc", limit=2)

print(f"Brief matching:")
for match in brief_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Best Matching

Combines both strategies for optimal results:

# Best matching (combines both)
best_matches = gg.match_best("Apple Inc", limit=2)

print(f"Best matching:")
for match in best_matches:
    canonical_score = match.get('canonical_score', 0)
    brief_score = match.get('brief_score', 0)
    print(f"  {match['original_name']} (Canonical: {canonical_score}, Brief: {brief_score})")

Score Threshold Analysis

Analyze how different score thresholds affect your results:

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

query = "Apple"
thresholds = [90, 80, 70, 60]

print(f"Score threshold comparison for: '{query}'")
print("=" * 50)

for min_score in thresholds:
    matches = gg.match_best(query, limit=3, min_score=min_score)
    print(f"\nMin Score {min_score}: {len(matches)} matches")
    for match in matches:
        print(f"  {match['original_name']} (Score: {match['score']})")

API Reference

CompanyMatcher

The main class for company matching operations.

Constructor

  • CompanyMatcher(parquet_path=None, category=None): Initialize with optional path or specific category

Methods

  • match_best(query, limit=3, min_score=70): Find best matches using combined strategy (loads data automatically)
  • match_canonical(query, limit=3, min_score=70): Find matches preserving legal suffixes (loads data automatically)
  • match_brief(query, limit=3, min_score=70): Find matches removing legal suffixes (loads data automatically)
  • show_available_categories(): Display all available categories with descriptions
  • list_available_categories(): Return list of available category names

Parameters

  • query: Company name to search for
  • limit: Maximum number of results to return
  • min_score: Minimum score threshold (0-100)

Returns

List of match dictionaries containing:

  • original_name: Company name from GLEIF database
  • score: Match confidence score
  • lei: Legal Entity Identifier
  • country: Country code
  • canonical_name: Standardized company name

Data Sources

GoodGLEIF automatically detects and uses the best available data source:

  1. Partitioned Files: GitHub-friendly partitioned parquet files (preferred)
  2. Single Parquet File: Fallback to single large parquet file
  3. Package Resources: Embedded data if distributed
  4. Error Messages: Helpful guidance if data is missing

Examples

All examples are available in the package and can be found at: https://github.com/microprediction/goodgleif/tree/main/goodgleif/examples

All examples are available as callable functions:

# Run examples directly
from goodgleif.examples.basic_matching_example import basic_matching_example
from goodgleif.examples.matching_strategies_example import matching_strategies_example
from goodgleif.examples.score_thresholds_example import score_thresholds_example
from goodgleif.examples.simple_usage_example import simple_usage_example
from goodgleif.examples.exchange_matching_example import exchange_matching_example

# Run with custom parameters
matches = basic_matching_example("Tesla", limit=5, min_score=85)
strategies = matching_strategies_example("Microsoft Corporation")

Development

Running Examples

# Run individual examples
python -m goodgleif.examples.basic_matching_example
python -m goodgleif.examples.simple_usage_example
python -m goodgleif.examples.comprehensive_example
python -m goodgleif.examples.lei_extraction_example
python -m goodgleif.examples.matching_strategies_example
python -m goodgleif.examples.score_thresholds_example
python -m goodgleif.examples.exchange_matching_example

# Run all examples test suite
python -m goodgleif.examples.run_all_examples

Testing

# Run all tests
pytest

# Run example tests specifically
pytest tests/goodgleif/examples/

Requirements

  • Python >= 3.9
  • pandas >= 2.1
  • pyarrow >= 14.0
  • rapidfuzz >= 3.6
  • platformdirs >= 4.2
  • pyyaml >= 6.0.1

License

See LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Support

For issues and questions, please use the GitHub issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodgleif-0.4.2.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goodgleif-0.4.2-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file goodgleif-0.4.2.tar.gz.

File metadata

  • Download URL: goodgleif-0.4.2.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.4.2.tar.gz
Algorithm Hash digest
SHA256 262cc0c45fc1efe55e35b1ef220663798f29a9d86090a1e2dc42475bb5af66d2
MD5 805ac5e83b0823e8a8b7bf21e600a845
BLAKE2b-256 69a0ddd9dfcef458b3bd21dab1ffc2e34ad114aae48e48c089cfa1e6d31e1145

See more details on using hashes here.

File details

Details for the file goodgleif-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: goodgleif-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 45.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e3b7c620979baf51ae53f340c5e51b61b5f482e3968432f775879360d5a27632
MD5 06d1490acaca3e2fe042f4b293c8e1db
BLAKE2b-256 1bd1c20814468ea6cebb2bbdcd360bc04795a2cd7b119364faf972dc589a989b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page