Lightweight tools for working with GLEIF LEI data: preprocess, load, fuzzy query.
Project description
GoodGLEIF
Lightweight tools for working with GLEIF LEI data: preprocess, load, and fuzzy query company information.
Features
- Smart Data Loading: Automatically finds and loads GLEIF data from multiple sources
- Fuzzy Matching: Advanced company name matching with multiple strategies
- Multiple Matching Strategies: Canonical, brief, and best matching approaches
Installation
pip install goodgleif
Quick Start
Basic Company Matching
from goodgleif.companymatcher import CompanyMatcher
# Initialize (data loads automatically on first match)
gg = CompanyMatcher()
# Search for companies (loads data automatically if needed)
matches = gg.match_best("Apple", limit=3, min_score=70)
print(f"Searching for: 'Apple'")
print("-" * 40)
for i, match in enumerate(matches, 1):
print(f"{i}. {match['original_name']}")
print(f" Score: {match['canonical_name']}")
print(f" LEI: {match['lei']}")
print(f" Country: {match['country']}")
print()
Simple Usage (No Path Required)
from goodgleif.companymatcher import CompanyMatcher
gg = CompanyMatcher() # Uses default classified data
print("2. Searching for companies (data loads automatically)...")
# Search for multiple companies
queries = ["Apple", "Microsoft", "Tesla", "Goldman Sachs"]
for query in queries:
print(f"\nSearching for: '{query}'")
matches = gg.match_best(query, limit=3, min_score=80)
if matches:
for i, match in enumerate(matches, 1):
print(f" {i}. {match['original_name']} (Score: {match['score']:.1f})")
print(f" LEI: {match['lei']} | Country: {match['country']}")
else:
print(f" No matches found for '{query}'")
print(f"\n4. System automatically used the best available data source!")
print(" - Partitioned files (if available)")
print(" - Single parquet file (fallback)")
print(" - Package resources (if distributed)")
print(" - Helpful error messages (if missing)")
Category-Specific Loading
Load specific industry categories for focused matching:
from goodgleif.companymatcher import CompanyMatcher
# Load only mining companies
mining_matcher = CompanyMatcher(category='obviously_mining')
mining_matches = mining_matcher.match_best("Gold Mining Corp")
# Load only financial companies
financial_matcher = CompanyMatcher(category='financial')
financial_matches = financial_matcher.match_best("Goldman Sachs")
# Load metals and mining companies
metals_matcher = CompanyMatcher(category='metals_and_mining')
metals_matches = metals_matcher.match_best("Steel Works Inc")
Available Categories
# See all available categories
matcher = CompanyMatcher()
matcher.show_available_categories()
# List available categories programmatically
categories = CompanyMatcher.list_available_categories()
print(f"Available categories: {categories}")
Category Loading Benefits
- Faster Loading: Only load the data you need
- Focused Results: Search within specific industries
- Smaller Memory Usage: Reduced memory footprint
- Better Performance: Faster matching for targeted searches
Matching Strategies
GoodGLEIF offers three different matching strategies:
Canonical Matching
Preserves legal suffixes like "Inc.", "Corp.", "LLC":
from goodgleif.companymatcher import CompanyMatcher
gg = CompanyMatcher()
# Canonical matching (preserves legal suffixes)
canonical_matches = gg.match_canonical("Apple Inc", limit=2)
print(f"Canonical matching:")
for match in canonical_matches:
print(f" {match['original_name']} (Score: {match['score']})")
Brief Matching
Removes legal suffixes for broader matching:
# Brief matching (removes legal suffixes)
brief_matches = gg.match_brief("Apple Inc", limit=2)
print(f"Brief matching:")
for match in brief_matches:
print(f" {match['original_name']} (Score: {match['score']})")
Best Matching
Combines both strategies for optimal results:
# Best matching (combines both)
best_matches = gg.match_best("Apple Inc", limit=2)
print(f"Best matching:")
for match in best_matches:
canonical_score = match.get('canonical_score', 0)
brief_score = match.get('brief_score', 0)
print(f" {match['original_name']} (Canonical: {canonical_score}, Brief: {brief_score})")
Score Threshold Analysis
Analyze how different score thresholds affect your results:
from goodgleif.companymatcher import CompanyMatcher
gg = CompanyMatcher()
query = "Apple"
thresholds = [90, 80, 70, 60]
print(f"Score threshold comparison for: '{query}'")
print("=" * 50)
for min_score in thresholds:
matches = gg.match_best(query, limit=3, min_score=min_score)
print(f"\nMin Score {min_score}: {len(matches)} matches")
for match in matches:
print(f" {match['original_name']} (Score: {match['score']})")
API Reference
CompanyMatcher
The main class for company matching operations.
Constructor
CompanyMatcher(parquet_path=None, category=None): Initialize with optional path or specific category
Methods
match_best(query, limit=3, min_score=70): Find best matches using combined strategy (loads data automatically)match_canonical(query, limit=3, min_score=70): Find matches preserving legal suffixes (loads data automatically)match_brief(query, limit=3, min_score=70): Find matches removing legal suffixes (loads data automatically)show_available_categories(): Display all available categories with descriptionslist_available_categories(): Return list of available category names
Parameters
query: Company name to search forlimit: Maximum number of results to returnmin_score: Minimum score threshold (0-100)
Returns
List of match dictionaries containing:
original_name: Company name from GLEIF databasescore: Match confidence scorelei: Legal Entity Identifiercountry: Country codecanonical_name: Standardized company name
Data Sources
GoodGLEIF automatically detects and uses the best available data source:
- Partitioned Files: GitHub-friendly partitioned parquet files (preferred)
- Single Parquet File: Fallback to single large parquet file
- Package Resources: Embedded data if distributed
- Error Messages: Helpful guidance if data is missing
Examples
All examples are available in the package and can be found at: https://github.com/microprediction/goodgleif/tree/main/goodgleif/examples
All examples are available as callable functions:
# Run examples directly
from goodgleif.examples.basic_matching_example import basic_matching_example
from goodgleif.examples.matching_strategies_example import matching_strategies_example
from goodgleif.examples.score_thresholds_example import score_thresholds_example
from goodgleif.examples.simple_usage_example import simple_usage_example
from goodgleif.examples.exchange_matching_example import exchange_matching_example
# Run with custom parameters
matches = basic_matching_example("Tesla", limit=5, min_score=85)
strategies = matching_strategies_example("Microsoft Corporation")
Development
Running Examples
# Run individual examples
python -m goodgleif.examples.basic_matching_example
python -m goodgleif.examples.simple_usage_example
python -m goodgleif.examples.comprehensive_example
python -m goodgleif.examples.lei_extraction_example
python -m goodgleif.examples.matching_strategies_example
python -m goodgleif.examples.score_thresholds_example
python -m goodgleif.examples.exchange_matching_example
# Run all examples test suite
python -m goodgleif.examples.run_all_examples
Testing
# Run all tests
pytest
# Run example tests specifically
pytest tests/goodgleif/examples/
Requirements
- Python >= 3.9
- pandas >= 2.1
- pyarrow >= 14.0
- rapidfuzz >= 3.6
- platformdirs >= 4.2
- pyyaml >= 6.0.1
License
See LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Support
For issues and questions, please use the GitHub issue tracker.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goodgleif-0.5.4.tar.gz.
File metadata
- Download URL: goodgleif-0.5.4.tar.gz
- Upload date:
- Size: 49.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75a474ccee490256be697f1713e3c5b0b9182a946df268462bb6167cd83d07e5
|
|
| MD5 |
0a01a9caacc5af0451f0554aec8be7a2
|
|
| BLAKE2b-256 |
3313d0b58edf7ae5b981ab50c253a75d53c32976a468ef5f21cb791e8c9c9b01
|
File details
Details for the file goodgleif-0.5.4-py3-none-any.whl.
File metadata
- Download URL: goodgleif-0.5.4-py3-none-any.whl
- Upload date:
- Size: 49.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c03db8e04bc9520df2ce642cd83cda0f7e6532c7de0c2fc37d70e4b19c313064
|
|
| MD5 |
fc20ea00caa23a51134e81cfc3c23497
|
|
| BLAKE2b-256 |
326e12e59853431b8db8c3545a27266de2e6921e204a1d112df142647c9fed85
|