Entity database for organizations, people, roles, and locations with embedding search
Project description
corp-entity-db
Entity database library and search engine for organizations, people, roles, and locations. Provides embedding-based semantic search over entities imported from GLEIF, SEC Edgar, Wikidata, and Companies House.
Installation
# Default: search and resolve (no build dependencies)
pip install corp-entity-db
# With database build/import support
pip install "corp-entity-db[build]"
# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"
# With remote client (EntityDBClient)
pip install "corp-entity-db[client]"
# Everything
pip install "corp-entity-db[all]"
The default install includes sentence-transformers, USearch, and huggingface_hub for searching and downloading pre-built databases. The embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use.
Quick Start
# Download the lite database + USearch indexes
corp-entity-db download
# Search organizations
corp-entity-db search "Microsoft"
corp-entity-db search "Microsoft" --hybrid
# Search people (composite embeddings: name + role + org)
corp-entity-db search-people "Tim Cook"
corp-entity-db search-people "Tim Cook" --role CEO --org Apple
# Show database statistics
corp-entity-db status
Python API
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path
# Search organizations
db = OrganizationDatabase(get_database_path())
embedder = CompanyEmbedder()
matches = db.search(embedder.embed("Microsoft"), top_k=10)
for record, score in matches:
print(f"{record.name} ({record.entity_type}) - score: {score:.3f}")
# Search people with composite embeddings + identity fallback
from corp_entity_db import PersonDatabase, get_person_database
from corp_entity_db.store import format_person_query
person_db = get_person_database()
query_emb = embedder.embed_composite_person("Tim Cook", role="CEO", org="Apple")
identity_emb = embedder.embed_for_identity_index(format_person_query("Tim Cook", person_type="executive"))
matches = person_db.search(query_emb, top_k=5, identity_query_embedding=identity_emb)
Server Mode
Keep models warm in memory for low-latency repeated searches (requires [serve] extra):
corp-entity-db serve # Start on localhost:8222
corp-entity-db serve --port 9000 # Custom port
Data Sources
| Source | Description | Scale |
|---|---|---|
| Wikidata | Organizations & notable people | ~1.5M orgs, ~13.2M people |
| GLEIF | Legal Entity Identifier records | ~2.6M orgs |
| SEC Edgar | US public company filers & officers | ~73K orgs |
| Companies House | UK registered companies | ~5.5M orgs |
Embedding Architecture
Organizations: Embeddings are stored as 768-dim float32 BLOBs in the organizations table. The full database enforces NOT NULL on the embedding column. Int8 scalar quantization is computed on-the-fly during USearch HNSW index building and is not stored separately.
People (Dual-Index Search): People use two USearch HNSW indexes, both generated on-the-fly during index building (no embeddings stored in SQLite):
- Primary composite index (
people_usearch.bin, 768-dim): Name, role, and organization are embedded as separate 256-dim vectors using Matryoshka truncation, independently L2-normalized, weighted (name=8, role=1, org=4), and concatenated. This gives AND-style matching: a poor match on organization cannot be compensated by a good match on name, enabling precise queries like "find the CEO named Tim Cook at Apple." Built bybuild_people_composite_index(). - Secondary identity index (
people_identity_usearch.bin, 256-dim): Natural language descriptions (e.g. "Taylor Swift, an artist", "Tim Cook, a CEO of Apple") embedded with Matryoshka truncation to 256 dims. Consulted as fallback when composite scores are below threshold (0.75). This improves accuracy for identity-defined people (artists, athletes, media, activists) who lack role/org context and would otherwise waste 512 of 768 composite dims as zeros. Built bybuild_people_identity_index().
Search accuracy: 82.5% acc@1, 91.4% acc@20 on 280 queries across 14 person types (60-80ms per query after model warmup), with identity fallback improving accuracy for identity-defined types.
Database Variants
- Lite (default download): Organization embedding column dropped, uses pre-built USearch HNSW indexes for search
- Full: Includes float32 embedding BLOBs in the organizations table
In both variants, people embeddings exist only in USearch index files (people_usearch.bin and people_identity_usearch.bin), never in SQLite.
Database Management
corp-entity-db post-import # Generate embeddings + build USearch indexes + VACUUM
corp-entity-db build-index # Rebuild all USearch HNSW indexes
corp-entity-db build-index --identity-only # Rebuild only the people identity index
corp-entity-db repair-embeddings # Generate missing embeddings + rebuild indexes
corp-entity-db migrate-embeddings # Migrate from legacy vec0 tables to embedding column
HuggingFace dataset: Corp-o-Rate-Community/entity-references
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corp_entity_db-0.3.0.tar.gz.
File metadata
- Download URL: corp_entity_db-0.3.0.tar.gz
- Upload date:
- Size: 179.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6a922363ec867ca0e4e9f4081f906a8c5e33efe9c5c8f3a54203eb36bc9ac84
|
|
| MD5 |
0764826e14cd458ce4bd9ed63b8db8a5
|
|
| BLAKE2b-256 |
8bccf326bab8691f5cb3b7f9d09f95340118e1bd5ac40f051f822ec8f3a2ade1
|
File details
Details for the file corp_entity_db-0.3.0-py3-none-any.whl.
File metadata
- Download URL: corp_entity_db-0.3.0-py3-none-any.whl
- Upload date:
- Size: 185.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1ba02d190380c38aac4014be7a5d02deeb77110b5d86a7a8dcd822c9fcb9909
|
|
| MD5 |
2368b44ee6d121925aa72610bd8cbba1
|
|
| BLAKE2b-256 |
d664c784568c14b61a144372feeffd3d586b35b27cce8abc15cc8132ab2c59a2
|