Skip to main content

Entity database for organizations, people, roles, and locations with embedding search

Project description

corp-entity-db

Entity database library and search engine for organizations, people, roles, and locations. Provides embedding-based semantic search over entities imported from GLEIF, SEC Edgar, Wikidata, and Companies House.

Installation

# Default: search and resolve (no build dependencies)
pip install corp-entity-db

# With database build/import support
pip install "corp-entity-db[build]"

# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"

# With remote client (EntityDBClient)
pip install "corp-entity-db[client]"

# Everything
pip install "corp-entity-db[all]"

The default install includes sentence-transformers, USearch, and huggingface_hub for searching and downloading pre-built databases. The embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use.

Quick Start

# Download the lite database + USearch indexes
corp-entity-db download

# Search organizations
corp-entity-db search "Microsoft"
corp-entity-db search "Microsoft" --hybrid

# Search people (composite embeddings: name + role + org)
corp-entity-db search-people "Tim Cook"
corp-entity-db search-people "Tim Cook" --role CEO --org Apple

# Show database statistics
corp-entity-db status

Python API

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

# Search organizations
db = OrganizationDatabase(get_database_path())
embedder = CompanyEmbedder()
matches = db.search(embedder.embed("Microsoft"), top_k=10)
for record, score in matches:
    print(f"{record.name} ({record.entity_type}) - score: {score:.3f}")

# Search people with composite embeddings + identity fallback
from corp_entity_db import PersonDatabase, get_person_database
from corp_entity_db.store import format_person_query
person_db = get_person_database()
query_emb = embedder.embed_composite_person("Tim Cook", role="CEO", org="Apple")
identity_emb = embedder.embed_for_identity_index(format_person_query("Tim Cook", person_type="executive"))
matches = person_db.search(query_emb, top_k=5, identity_query_embedding=identity_emb)

Server Mode

Keep models warm in memory for low-latency repeated searches (requires [serve] extra):

corp-entity-db serve                  # Start on localhost:8222
corp-entity-db serve --port 9000      # Custom port

Data Sources

Source Description Scale
Wikidata Organizations & notable people ~1.5M orgs, ~13.2M people
GLEIF Legal Entity Identifier records ~2.6M orgs
SEC Edgar US public company filers & officers ~73K orgs
Companies House UK registered companies ~5.5M orgs

Embedding Architecture

Organizations: Embeddings are stored as 768-dim float32 BLOBs in the organizations table. The full database enforces NOT NULL on the embedding column. Int8 scalar quantization is computed on-the-fly during USearch HNSW index building and is not stored separately.

People (Dual-Index Search): People use two USearch HNSW indexes, both generated on-the-fly during index building (no embeddings stored in SQLite):

  • Primary composite index (people_usearch.bin, 768-dim): Name, role, and organization are embedded as separate 256-dim vectors using Matryoshka truncation, independently L2-normalized, weighted (name=8, role=1, org=4), and concatenated. This gives AND-style matching: a poor match on organization cannot be compensated by a good match on name, enabling precise queries like "find the CEO named Tim Cook at Apple." Built by build_people_composite_index().
  • Secondary identity index (people_identity_usearch.bin, 256-dim): Natural language descriptions (e.g. "Taylor Swift, an artist", "Tim Cook, a CEO of Apple") embedded with Matryoshka truncation to 256 dims. Consulted as fallback when composite scores are below threshold (0.75). This improves accuracy for identity-defined people (artists, athletes, media, activists) who lack role/org context and would otherwise waste 512 of 768 composite dims as zeros. Built by build_people_identity_index().

Search accuracy: 82.5% acc@1, 91.4% acc@20 on 280 queries across 14 person types (60-80ms per query after model warmup), with identity fallback improving accuracy for identity-defined types.

Database Variants

  • Lite (default download): Organization embedding column dropped, uses pre-built USearch HNSW indexes for search
  • Full: Includes float32 embedding BLOBs in the organizations table

In both variants, people embeddings exist only in USearch index files (people_usearch.bin and people_identity_usearch.bin), never in SQLite.

Database Management

corp-entity-db post-import             # Generate embeddings + build USearch indexes + VACUUM
corp-entity-db build-index             # Rebuild all USearch HNSW indexes
corp-entity-db build-index --identity-only  # Rebuild only the people identity index
corp-entity-db repair-embeddings       # Generate missing embeddings + rebuild indexes
corp-entity-db migrate-embeddings      # Migrate from legacy vec0 tables to embedding column

HuggingFace dataset: Corp-o-Rate-Community/entity-references

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corp_entity_db-0.3.0.tar.gz (179.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corp_entity_db-0.3.0-py3-none-any.whl (185.4 kB view details)

Uploaded Python 3

File details

Details for the file corp_entity_db-0.3.0.tar.gz.

File metadata

  • Download URL: corp_entity_db-0.3.0.tar.gz
  • Upload date:
  • Size: 179.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for corp_entity_db-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c6a922363ec867ca0e4e9f4081f906a8c5e33efe9c5c8f3a54203eb36bc9ac84
MD5 0764826e14cd458ce4bd9ed63b8db8a5
BLAKE2b-256 8bccf326bab8691f5cb3b7f9d09f95340118e1bd5ac40f051f822ec8f3a2ade1

See more details on using hashes here.

File details

Details for the file corp_entity_db-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for corp_entity_db-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b1ba02d190380c38aac4014be7a5d02deeb77110b5d86a7a8dcd822c9fcb9909
MD5 2368b44ee6d121925aa72610bd8cbba1
BLAKE2b-256 d664c784568c14b61a144372feeffd3d586b35b27cce8abc15cc8132ab2c59a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page