Skip to main content

Lightweight tools for working with GLEIF LEI data: preprocess, load, fuzzy query.

Project description

GoodGLEIF

Lightweight tools for working with GLEIF LEI data: preprocess, load, and fuzzy query company information.

Features

  • Smart Data Loading: Automatically finds and loads GLEIF data from multiple sources
  • Fuzzy Matching: Advanced company name matching with multiple strategies
  • Multiple Matching Strategies: Canonical, brief, and best matching approaches

Installation

pip install goodgleif

Quick Start

Basic Company Matching

from goodgleif.companymatcher import CompanyMatcher

# Initialize (data loads automatically on first match)
gg = CompanyMatcher()

# Search for companies (loads data automatically if needed)
matches = gg.match_best("Apple", limit=3, min_score=70)

print(f"Searching for: 'Apple'")
print("-" * 40)

for i, match in enumerate(matches, 1):
    print(f"{i}. {match['original_name']}")
    print(f"   Score: {match['canonical_name']}")
    print(f"   LEI: {match['lei']}")
    print(f"   Country: {match['country']}")
    print()

Simple Usage (No Path Required)

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()  # Uses default classified data

print("2. Searching for companies (data loads automatically)...")

# Search for multiple companies
queries = ["Apple", "Microsoft", "Tesla", "Goldman Sachs"]

for query in queries:
    print(f"\nSearching for: '{query}'")
    matches = gg.match_best(query, limit=3, min_score=80)
    
    if matches:
        for i, match in enumerate(matches, 1):
            print(f"  {i}. {match['original_name']} (Score: {match['score']:.1f})")
            print(f"     LEI: {match['lei']} | Country: {match['country']}")
    else:
        print(f"  No matches found for '{query}'")

print(f"\n4. System automatically used the best available data source!")
print("   - Partitioned files (if available)")
print("   - Single parquet file (fallback)")
print("   - Package resources (if distributed)")
print("   - Helpful error messages (if missing)")

Category-Specific Loading

Load specific industry categories for focused matching:

from goodgleif.companymatcher import CompanyMatcher

# Load only mining companies
mining_matcher = CompanyMatcher(category='obviously_mining')
mining_matches = mining_matcher.match_best("Gold Mining Corp")

# Load only financial companies  
financial_matcher = CompanyMatcher(category='financial')
financial_matches = financial_matcher.match_best("Goldman Sachs")

# Load metals and mining companies
metals_matcher = CompanyMatcher(category='metals_and_mining')
metals_matches = metals_matcher.match_best("Steel Works Inc")

Available Categories

# See all available categories
matcher = CompanyMatcher()
matcher.show_available_categories()

# List available categories programmatically
categories = CompanyMatcher.list_available_categories()
print(f"Available categories: {categories}")

Category Loading Benefits

  • Faster Loading: Only load the data you need
  • Focused Results: Search within specific industries
  • Smaller Memory Usage: Reduced memory footprint
  • Better Performance: Faster matching for targeted searches

Matching Strategies

GoodGLEIF offers three different matching strategies:

Canonical Matching

Preserves legal suffixes like "Inc.", "Corp.", "LLC":

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

# Canonical matching (preserves legal suffixes)
canonical_matches = gg.match_canonical("Apple Inc", limit=2)

print(f"Canonical matching:")
for match in canonical_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Brief Matching

Removes legal suffixes for broader matching:

# Brief matching (removes legal suffixes)
brief_matches = gg.match_brief("Apple Inc", limit=2)

print(f"Brief matching:")
for match in brief_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Best Matching

Combines both strategies for optimal results:

# Best matching (combines both)
best_matches = gg.match_best("Apple Inc", limit=2)

print(f"Best matching:")
for match in best_matches:
    canonical_score = match.get('canonical_score', 0)
    brief_score = match.get('brief_score', 0)
    print(f"  {match['original_name']} (Canonical: {canonical_score}, Brief: {brief_score})")

Score Threshold Analysis

Analyze how different score thresholds affect your results:

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

query = "Apple"
thresholds = [90, 80, 70, 60]

print(f"Score threshold comparison for: '{query}'")
print("=" * 50)

for min_score in thresholds:
    matches = gg.match_best(query, limit=3, min_score=min_score)
    print(f"\nMin Score {min_score}: {len(matches)} matches")
    for match in matches:
        print(f"  {match['original_name']} (Score: {match['score']})")

API Reference

CompanyMatcher

The main class for company matching operations.

Constructor

  • CompanyMatcher(parquet_path=None, category=None): Initialize with optional path or specific category

Methods

  • match_best(query, limit=3, min_score=70): Find best matches using combined strategy (loads data automatically)
  • match_canonical(query, limit=3, min_score=70): Find matches preserving legal suffixes (loads data automatically)
  • match_brief(query, limit=3, min_score=70): Find matches removing legal suffixes (loads data automatically)
  • show_available_categories(): Display all available categories with descriptions
  • list_available_categories(): Return list of available category names

Parameters

  • query: Company name to search for
  • limit: Maximum number of results to return
  • min_score: Minimum score threshold (0-100)

Returns

List of match dictionaries containing:

  • original_name: Company name from GLEIF database
  • score: Match confidence score
  • lei: Legal Entity Identifier
  • country: Country code
  • canonical_name: Standardized company name

Data Sources

GoodGLEIF automatically detects and uses the best available data source:

  1. Partitioned Files: GitHub-friendly partitioned parquet files (preferred)
  2. Single Parquet File: Fallback to single large parquet file
  3. Package Resources: Embedded data if distributed
  4. Error Messages: Helpful guidance if data is missing

Examples

All examples are available in the package and can be found at: https://github.com/microprediction/goodgleif/tree/main/goodgleif/examples

All examples are available as callable functions:

# Run examples directly
from goodgleif.examples.basic_matching_example import basic_matching_example
from goodgleif.examples.matching_strategies_example import matching_strategies_example
from goodgleif.examples.score_thresholds_example import score_thresholds_example
from goodgleif.examples.simple_usage_example import simple_usage_example
from goodgleif.examples.exchange_matching_example import exchange_matching_example

# Run with custom parameters
matches = basic_matching_example("Tesla", limit=5, min_score=85)
strategies = matching_strategies_example("Microsoft Corporation")

Development

Running Examples

# Run individual examples
python -m goodgleif.examples.basic_matching_example
python -m goodgleif.examples.simple_usage_example
python -m goodgleif.examples.comprehensive_example
python -m goodgleif.examples.lei_extraction_example
python -m goodgleif.examples.matching_strategies_example
python -m goodgleif.examples.score_thresholds_example
python -m goodgleif.examples.exchange_matching_example

# Run all examples test suite
python -m goodgleif.examples.run_all_examples

Testing

# Run all tests
pytest

# Run example tests specifically
pytest tests/goodgleif/examples/

Requirements

  • Python >= 3.9
  • pandas >= 2.1
  • pyarrow >= 14.0
  • rapidfuzz >= 3.6
  • platformdirs >= 4.2
  • pyyaml >= 6.0.1

License

See LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Support

For issues and questions, please use the GitHub issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodgleif-0.5.4.tar.gz (49.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goodgleif-0.5.4-py3-none-any.whl (49.4 MB view details)

Uploaded Python 3

File details

Details for the file goodgleif-0.5.4.tar.gz.

File metadata

  • Download URL: goodgleif-0.5.4.tar.gz
  • Upload date:
  • Size: 49.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.5.4.tar.gz
Algorithm Hash digest
SHA256 75a474ccee490256be697f1713e3c5b0b9182a946df268462bb6167cd83d07e5
MD5 0a01a9caacc5af0451f0554aec8be7a2
BLAKE2b-256 3313d0b58edf7ae5b981ab50c253a75d53c32976a468ef5f21cb791e8c9c9b01

See more details on using hashes here.

File details

Details for the file goodgleif-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: goodgleif-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 49.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c03db8e04bc9520df2ce642cd83cda0f7e6532c7de0c2fc37d70e4b19c313064
MD5 fc20ea00caa23a51134e81cfc3c23497
BLAKE2b-256 326e12e59853431b8db8c3545a27266de2e6921e204a1d112df142647c9fed85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page