Skip to main content

Lightweight tools for working with GLEIF LEI data: preprocess, load, fuzzy query.

Project description

GoodGLEIF

Lightweight tools for working with GLEIF LEI data: preprocess, load, and fuzzy query company information.

Features

  • Smart Data Loading: Automatically finds and loads GLEIF data from multiple sources
  • Fuzzy Matching: Advanced company name matching with multiple strategies
  • Multiple Matching Strategies: Canonical, brief, and best matching approaches

Installation

pip install goodgleif

Quick Start

Basic Company Matching

from goodgleif.companymatcher import CompanyMatcher

# Initialize (data loads automatically on first match)
gg = CompanyMatcher()

# Search for companies (loads data automatically if needed)
matches = gg.match_best("Apple", limit=3, min_score=70)

print(f"Searching for: 'Apple'")
print("-" * 40)

for i, match in enumerate(matches, 1):
    print(f"{i}. {match['original_name']}")
    print(f"   Score: {match['canonical_name']}")
    print(f"   LEI: {match['lei']}")
    print(f"   Country: {match['country']}")
    print()

Simple Usage (No Path Required)

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()  # Uses default classified data

print("2. Searching for companies (data loads automatically)...")

# Search for multiple companies
queries = ["Apple", "Microsoft", "Tesla", "Goldman Sachs"]

for query in queries:
    print(f"\nSearching for: '{query}'")
    matches = gg.match_best(query, limit=3, min_score=80)
    
    if matches:
        for i, match in enumerate(matches, 1):
            print(f"  {i}. {match['original_name']} (Score: {match['score']:.1f})")
            print(f"     LEI: {match['lei']} | Country: {match['country']}")
    else:
        print(f"  No matches found for '{query}'")

print(f"\n4. System automatically used the best available data source!")
print("   - Partitioned files (if available)")
print("   - Single parquet file (fallback)")
print("   - Package resources (if distributed)")
print("   - Helpful error messages (if missing)")

Matching Strategies

GoodGLEIF offers three different matching strategies:

Canonical Matching

Preserves legal suffixes like "Inc.", "Corp.", "LLC":

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

# Canonical matching (preserves legal suffixes)
canonical_matches = gg.match_canonical("Apple Inc", limit=2)

print(f"Canonical matching:")
for match in canonical_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Brief Matching

Removes legal suffixes for broader matching:

# Brief matching (removes legal suffixes)
brief_matches = gg.match_brief("Apple Inc", limit=2)

print(f"Brief matching:")
for match in brief_matches:
    print(f"  {match['original_name']} (Score: {match['score']})")

Best Matching

Combines both strategies for optimal results:

# Best matching (combines both)
best_matches = gg.match_best("Apple Inc", limit=2)

print(f"Best matching:")
for match in best_matches:
    canonical_score = match.get('canonical_score', 0)
    brief_score = match.get('brief_score', 0)
    print(f"  {match['original_name']} (Canonical: {canonical_score}, Brief: {brief_score})")

Score Threshold Analysis

Analyze how different score thresholds affect your results:

from goodgleif.companymatcher import CompanyMatcher

gg = CompanyMatcher()

query = "Apple"
thresholds = [90, 80, 70, 60]

print(f"Score threshold comparison for: '{query}'")
print("=" * 50)

for min_score in thresholds:
    matches = gg.match_best(query, limit=3, min_score=min_score)
    print(f"\nMin Score {min_score}: {len(matches)} matches")
    for match in matches:
        print(f"  {match['original_name']} (Score: {match['score']})")

API Reference

CompanyMatcher

The main class for company matching operations.

Methods

  • match_best(query, limit=3, min_score=70): Find best matches using combined strategy (loads data automatically)
  • match_canonical(query, limit=3, min_score=70): Find matches preserving legal suffixes (loads data automatically)
  • match_brief(query, limit=3, min_score=70): Find matches removing legal suffixes (loads data automatically)

Parameters

  • query: Company name to search for
  • limit: Maximum number of results to return
  • min_score: Minimum score threshold (0-100)

Returns

List of match dictionaries containing:

  • original_name: Company name from GLEIF database
  • score: Match confidence score
  • lei: Legal Entity Identifier
  • country: Country code
  • canonical_name: Standardized company name

Data Sources

GoodGLEIF automatically detects and uses the best available data source:

  1. Partitioned Files: GitHub-friendly partitioned parquet files (preferred)
  2. Single Parquet File: Fallback to single large parquet file
  3. Package Resources: Embedded data if distributed
  4. Error Messages: Helpful guidance if data is missing

Examples

All examples are available as callable functions:

# Run examples directly
from goodgleif.examples.basic_matching_example import basic_matching_example
from goodgleif.examples.matching_strategies_example import matching_strategies_example
from goodgleif.examples.score_thresholds_example import score_thresholds_example
from goodgleif.examples.simple_usage_example import simple_usage_example
from goodgleif.examples.exchange_matching_example import exchange_matching_example

# Run with custom parameters
matches = basic_matching_example("Tesla", limit=5, min_score=85)
strategies = matching_strategies_example("Microsoft Corporation")

Development

Running Examples

# Run individual examples
python -m goodgleif.examples.basic_matching_example
python -m goodgleif.examples.matching_strategies_example
python -m goodgleif.examples.score_thresholds_example
python -m goodgleif.examples.simple_usage_example
python -m goodgleif.examples.exchange_matching_example

Testing

# Run all tests
pytest

# Run example tests specifically
pytest tests/goodgleif/examples/

Requirements

  • Python >= 3.9
  • pandas >= 2.1
  • pyarrow >= 14.0
  • rapidfuzz >= 3.6
  • platformdirs >= 4.2
  • pyyaml >= 6.0.1

License

See LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Support

For issues and questions, please use the GitHub issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodgleif-0.0.8.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goodgleif-0.0.8-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file goodgleif-0.0.8.tar.gz.

File metadata

  • Download URL: goodgleif-0.0.8.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.0.8.tar.gz
Algorithm Hash digest
SHA256 e79423f0a48a78399403eab99c9d0dfb59058312bea90895466adaea2de59a73
MD5 1880c44e659278ee47c632e3cc6accbd
BLAKE2b-256 ef6c5cd9c88efa8ae3ce408da1241d1c9a6818f2a3ca63d4490d73d309d91d64

See more details on using hashes here.

File details

Details for the file goodgleif-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: goodgleif-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for goodgleif-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 3d7c494fcfb52ea8c5b21fd53ca20b05a4696d7ca48f49513fd6fa95e093a1b8
MD5 84297ee9797bf9f53369eb9a48a48c20
BLAKE2b-256 303fc7f4a4e853954cc0512e715b65a7a08a58af57ecbfc439d470c6115af48a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page