Skip to main content

Entity database for organizations, people, roles, and locations with embedding search

Project description

corp-entity-db

Entity database library and search engine for organizations, people, roles, and locations. Provides embedding-based semantic search over entities imported from GLEIF, SEC Edgar, Wikidata, and Companies House.

Installation

# Default: search and resolve (no build dependencies)
pip install corp-entity-db

# With database build/import support
pip install "corp-entity-db[build]"

# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"

# With remote client (EntityDBClient)
pip install "corp-entity-db[client]"

# Everything
pip install "corp-entity-db[all]"

The default install includes sentence-transformers, USearch, and huggingface_hub for searching and downloading pre-built databases. The embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use.

Quick Start

# Download the lite database + USearch indexes
corp-entity-db download

# Search organizations
corp-entity-db search "Microsoft"

# Search people (composite embeddings: name + role + org)
corp-entity-db search-people "Tim Cook"
corp-entity-db search-people "Tim Cook" --role CEO --org Apple

# Show database statistics
corp-entity-db status

Python API

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

# Search organizations
db = OrganizationDatabase(get_database_path())
embedder = CompanyEmbedder()
matches = db.search(embedder.embed("Microsoft"), top_k=10)
for record, score in matches:
    print(f"{record.name} ({record.entity_type}) - score: {score:.3f}")

# Search people with composite embeddings + name fallback + identity fallback
from corp_entity_db import PersonDatabase, get_person_database
person_db = get_person_database()
query_emb = embedder.embed_composite_person("Tim Cook", role="CEO", org="Apple")
matches = person_db.search(query_emb, top_k=5, query_name="Tim Cook", embedder=embedder, query_role="CEO", query_org="Apple")

Server Mode

Keep models warm in memory for low-latency repeated searches (requires [serve] extra):

corp-entity-db serve                  # Start on localhost:8222
corp-entity-db serve --port 9000      # Custom port

Data Sources

Source Description Scale
Wikidata Organizations, people, roles & locations ~1.6M orgs, ~28M people, ~700K locations, ~180K roles
GLEIF Legal Entity Identifier records ~2.6M orgs
SEC Edgar US public company filers & officers ~73K orgs
Companies House UK registered companies ~5.5M orgs

Embedding Architecture

All embeddings (organizations and people) exist only in USearch HNSW indexes, never in SQLite. They are generated on-the-fly during index building using google/embeddinggemma-300m. Int8 scalar quantization is computed during index building and is not stored separately.

People (Dual-Index Search): People use two USearch HNSW indexes:

  • Primary composite index (people_usearch_v5.bin, 768-dim): Name, role, and organization are embedded as separate 256-dim vectors using Matryoshka truncation, independently L2-normalized, weighted (name=8, role=1, org=4), and concatenated. Only indexes people with org associations. This gives AND-style matching: a poor match on organization cannot be compensated by a good match on name, enabling precise queries like "find the CEO named Tim Cook at Apple." Built by build_people_composite_index().
  • Secondary identity index (people_identity_usearch_v5.bin, 256-dim): Name-only embeddings with Matryoshka truncation to 256 dims for all people. Consulted as fallback when composite search and SQL name lookup fail. Built by build_people_identity_index().

Search accuracy: 100% acc@1, 100% acc@20 on 280 queries across 12 person types (100-200ms per query after model warmup). Three-tier fallback: composite HNSW → SQL name_normalized lookup (using corp-names for normalization, with disambiguation blending description similarity (40%), name Levenshtein (45%), and popularity via log-scaled canon_size (15%), plus multi-description support trying alternative role/org combinations within canonical groups) → identity HNSW.

Database Variants

  • Lite (default download): record column stripped, name_normalized kept on all tables
  • Full: Includes all columns with source record metadata

In both variants, all embeddings exist only in versioned USearch index files (organizations_usearch_v5.bin, people_usearch_v5.bin, people_identity_usearch_v5.bin), never in SQLite.

Database Management

corp-entity-db migrate                 # Migrate schema to latest (v5)
corp-entity-db post-import             # Build USearch indexes + VACUUM
corp-entity-db build-index             # Rebuild all USearch HNSW indexes
corp-entity-db build-identity-index    # Rebuild only the people identity index
corp-entity-db normalize-people        # Normalize people names using corp-names
corp-entity-db normalize-orgs          # Normalize organization names using corp-names
corp-entity-db reclassify-people       # Recalculate person_type classifications
corp-entity-db people-test             # Run people search accuracy test
corp-entity-db org-test                # Run organization search accuracy test

HuggingFace dataset: Corp-o-Rate-Community/entity-references

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corp_entity_db-0.5.2.tar.gz (196.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corp_entity_db-0.5.2-py3-none-any.whl (203.5 kB view details)

Uploaded Python 3

File details

Details for the file corp_entity_db-0.5.2.tar.gz.

File metadata

  • Download URL: corp_entity_db-0.5.2.tar.gz
  • Upload date:
  • Size: 196.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for corp_entity_db-0.5.2.tar.gz
Algorithm Hash digest
SHA256 d9fc1527bfc7f6fdf453a74118fd8c97234b65593a318eea059f3ff20a2e8dcd
MD5 f2bbc53dac836e2cf91de90f4cc426a5
BLAKE2b-256 48f453b7d2eac4b86a2a408ea86c871728a641e992fefbdcabd13bf57d50dddd

See more details on using hashes here.

File details

Details for the file corp_entity_db-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for corp_entity_db-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7375e9e5fb176d0dd22a1e917312d15befc8438d574fae183bc6f3ca5d585476
MD5 9be94ada98bd0f2d5e062fa603a0650b
BLAKE2b-256 f76aa472d0cea4dcf9447bd7dec81c393622e7da9dfde5d577554b00c9fab239

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page