Entity database for organizations, people, roles, and locations with embedding search
Project description
corp-entity-db
Entity database library and search engine for organizations, people, roles, and locations. Provides embedding-based semantic search over entities imported from GLEIF, SEC Edgar, Wikidata, and Companies House.
Installation
# Default: search and resolve (no build dependencies)
pip install corp-entity-db
# With database build/import support
pip install "corp-entity-db[build]"
# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"
# With remote client (EntityDBClient)
pip install "corp-entity-db[client]"
# Everything
pip install "corp-entity-db[all]"
The default install includes sentence-transformers, USearch, and huggingface_hub for searching and downloading pre-built databases. The embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use.
Quick Start
# Download the lite database + USearch indexes
corp-entity-db download
# Search organizations
corp-entity-db search "Microsoft"
# Search people (composite embeddings: name + role + org)
corp-entity-db search-people "Tim Cook"
corp-entity-db search-people "Tim Cook" --role CEO --org Apple
# Show database statistics
corp-entity-db status
Python API
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path
# Search organizations
db = OrganizationDatabase(get_database_path())
embedder = CompanyEmbedder()
matches = db.search(embedder.embed("Microsoft"), top_k=10)
for record, score in matches:
print(f"{record.name} ({record.entity_type}) - score: {score:.3f}")
# Search people with composite embeddings + name fallback + identity fallback
from corp_entity_db import PersonDatabase, get_person_database
person_db = get_person_database()
query_emb = embedder.embed_composite_person("Tim Cook", role="CEO", org="Apple")
matches = person_db.search(query_emb, top_k=5, query_name="Tim Cook", embedder=embedder, query_role="CEO", query_org="Apple")
Server Mode
Keep models warm in memory for low-latency repeated searches (requires [serve] extra):
corp-entity-db serve # Start on localhost:8222
corp-entity-db serve --port 9000 # Custom port
Data Sources
| Source | Description | Scale |
|---|---|---|
| Wikidata | Organizations, people, roles & locations | ~1.6M orgs, ~28M people, ~700K locations, ~180K roles |
| GLEIF | Legal Entity Identifier records | ~2.6M orgs |
| SEC Edgar | US public company filers & officers | ~73K orgs |
| Companies House | UK registered companies | ~5.5M orgs |
Embedding Architecture
All embeddings (organizations and people) exist only in USearch HNSW indexes, never in SQLite. They are generated on-the-fly during index building using google/embeddinggemma-300m. Int8 scalar quantization is computed during index building and is not stored separately.
People (Dual-Index Search): People use two USearch HNSW indexes:
- Primary composite index (
people_usearch_v5.bin, 768-dim): Name, role, and organization are embedded as separate 256-dim vectors using Matryoshka truncation, independently L2-normalized, weighted (name=8, role=1, org=4), and concatenated. Only indexes people with org associations. This gives AND-style matching: a poor match on organization cannot be compensated by a good match on name, enabling precise queries like "find the CEO named Tim Cook at Apple." Built bybuild_people_composite_index(). - Secondary identity index (
people_identity_usearch_v5.bin, 256-dim): Name-only embeddings with Matryoshka truncation to 256 dims for all people. Consulted as fallback when composite search and SQL name lookup fail. Built bybuild_people_identity_index().
Search accuracy: 100% acc@1, 100% acc@20 on 280 queries across 12 person types (100-200ms per query after model warmup). Three-tier fallback: composite HNSW → SQL name_normalized lookup (using corp-names for normalization, with disambiguation blending description similarity (40%), name Levenshtein (45%), and popularity via log-scaled canon_size (15%), plus multi-description support trying alternative role/org combinations within canonical groups) → identity HNSW.
Database Variants
- Lite (default download):
recordcolumn stripped,name_normalizedkept on all tables - Full: Includes all columns with source record metadata
In both variants, all embeddings exist only in versioned USearch index files (organizations_usearch_v5.bin, people_usearch_v5.bin, people_identity_usearch_v5.bin), never in SQLite.
Database Management
corp-entity-db migrate # Migrate schema to latest (v5)
corp-entity-db post-import # Build USearch indexes + VACUUM
corp-entity-db build-index # Rebuild all USearch HNSW indexes
corp-entity-db build-identity-index # Rebuild only the people identity index
corp-entity-db normalize-people # Normalize people names using corp-names
corp-entity-db normalize-orgs # Normalize organization names using corp-names
corp-entity-db reclassify-people # Recalculate person_type classifications
corp-entity-db people-test # Run people search accuracy test
corp-entity-db org-test # Run organization search accuracy test
HuggingFace dataset: Corp-o-Rate-Community/entity-references
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corp_entity_db-0.5.1.tar.gz.
File metadata
- Download URL: corp_entity_db-0.5.1.tar.gz
- Upload date:
- Size: 196.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09a551df7e67c4b767af8df08c8a52100ddf6f2ac6a12440266667c402564f2e
|
|
| MD5 |
609e50b0b15a96a44476b2b274565ee4
|
|
| BLAKE2b-256 |
4c7d582f6a4eef722ba65f9433b6394063ecc9cc1b9597efce200dccf947a566
|
File details
Details for the file corp_entity_db-0.5.1-py3-none-any.whl.
File metadata
- Download URL: corp_entity_db-0.5.1-py3-none-any.whl
- Upload date:
- Size: 203.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
936b847995adcfd8c50d50038cc039ec02b38ee78d58feb9a107abe0929023e9
|
|
| MD5 |
88f05800ea630eb0a34bb577ff43c844
|
|
| BLAKE2b-256 |
5813ae0e2bd0be0bc2c6688ec10c639416655e9f640ef5a1aeee2425540b5de1
|