Skip to main content

Extract categorical schema from a corpus of documents. Built on Toponymy and EVoC.

Project description

Typologist

Extract a categorical schema from a corpus of documents. Built on Toponymy and EVoC.

Status

Pre-alpha. The public API will change as we figure things out. See docs/design.md for the current contract.

What it does

You give it documents and their embeddings. It gives you back a handful of categorical facets and a per-document label for each. For example, run it on ~1000 arxiv ML papers and you'll typically get three facets (say contribution_type, primary_data_modality, and application_domain), each with 6-10 values, plus a DataFrame of per-doc labels you can join straight back onto the original corpus.

If you already have known metadata that you don't want rediscovered (existing category tags, publication year, source, whatever), you pass that in too and Typologist concept-erases it first via LEACE, so the facets it finds are orthogonal to what you already had.

Install

Requires Python 3.11+.

uv add git+https://github.com/stevenfazzio/typologist.git
# or: pip install git+https://github.com/stevenfazzio/typologist.git

You'll also want:

  • ANTHROPIC_API_KEY in the environment (or your own LLM callable for each of the three roles; see below).

  • A sentence-embedding model that Toponymy can use internally for keyphrases and topic names. sentence-transformers with MiniLM is cheap and good enough for most use cases:

    uv pip install sentence-transformers
    

Quick start

import numpy as np
from sentence_transformers import SentenceTransformer
from typologist import Typologist

documents = [...]              # list[str], one per document
embeddings = np.array(...)     # shape (n_docs, d), float

t = Typologist(
    n_facets=3,
    topic_embedder=SentenceTransformer("all-MiniLM-L6-v2"),
).fit(documents, embeddings)

print(t.schema_)               # list[dict]: discovered facet definitions
print(t.labels_df_)            # (n_docs, n_facets) DataFrame of categorical labels

Discovery with metadata erasure

If your documents come with known metadata (source, category, year), you usually don't want Typologist to rediscover those axes. You want the facets it finds to be orthogonal to what you already have. Pass a metadata DataFrame and Typologist concept-erases those axes before running discovery.

import pandas as pd
from sentence_transformers import SentenceTransformer
from typologist import Typologist

df = pd.read_parquet("arxiv_sample.parquet")   # title, abstract, primary_category, ...
embeddings = np.load("arxiv_cohere_v4.npy")    # shape (len(df), 1536)
docs = (df["title"] + "\n\n" + df["abstract"]).rename("document")

t = Typologist(
    n_facets=3,
    topic_embedder=SentenceTransformer("all-MiniLM-L6-v2"),
    object_description="scientific paper",
    corpus_description="machine-learning arxiv papers",
    random_state=0,
    verbose=True,
).fit(
    docs,
    embeddings,
    metadata=df[["primary_category"]],
)

# Join labels back onto the original DataFrame
df_labeled = df.join(t.labels_df_)

# Inspect the schema
for facet in t.schema_:
    print(f"{facet['name']} ({len(facet['values'])} values): {facet['definition']}")
    for value in facet["values"]:
        print(f"  - {value}")

# Cross-tab a discovered facet against held-out metadata
pd.crosstab(df_labeled[t.schema_[0]["name"]], df_labeled["primary_category"])

Per-facet diagnostics (cluster counts, label entropy, exemplar documents) live on t.facet_diagnostics_.

Reusing a discovered schema

Every facet entry stores its own labeling_prompt_template and labeling_model, so you can apply a schema to new documents without re-running discovery:

from typologist import apply_schema

new_labels = apply_schema(schema=t.schema_, documents=new_docs)

See docs/design.md for the full schema entry shape and apply_schema contract.

Performance

Per-document labeling runs through a threadpool (max_concurrency=10 by default). On 1000 docs with n_facets=3 you should see roughly 6-8 minutes end to end. Toponymy's cluster naming and the schema-synthesis LLM calls are still serial; full async is a 0.2 item.

Related

Typologist is an independent project with no affiliation to the authors of the libraries it builds on:

If you want a 2D embedding projection with your Typologist labels on top, DataMapPlot is a natural match.

License

BSD-3-Clause. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typologist-0.0.1.tar.gz (188.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

typologist-0.0.1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file typologist-0.0.1.tar.gz.

File metadata

  • Download URL: typologist-0.0.1.tar.gz
  • Upload date:
  • Size: 188.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for typologist-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a6e0aedf1762cdef147a62a199eac1eddce3300747f0514cbdbaf798b42df98d
MD5 b0cacdd24e6fa893ae43fd4b2609c620
BLAKE2b-256 70eb740141d13123dabeb9551e5ad04cfb26ed212a3f406bf5ecdc7cc4ea47c9

See more details on using hashes here.

File details

Details for the file typologist-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: typologist-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for typologist-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e3d68d556963c263cce553a34c6737fae21680d9c890001e9eaffe46dbf85fe8
MD5 46d122299c74131c93dedbae3e92d39b
BLAKE2b-256 d416bcfafafb1078a3edf9fb151223612b020c8b4eead4ad6ad15f6d262fb7ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page