Extract categorical schema from a corpus of documents. Built on Toponymy and EVoC.

These details have not been verified by PyPI

Project links

Project description

Typologist

Extract a categorical schema from a corpus of documents. Built on Toponymy and EVoC.

Status

Pre-alpha. The public API will change as we figure things out. See docs/design.md for the current contract.

What it does

You give it documents and their embeddings. It gives you back a handful of categorical facets and a per-document label for each. For example, run it on ~1000 arxiv ML papers and you'll typically get three facets (say contribution_type, primary_data_modality, and application_domain), each with 6-10 values, plus a DataFrame of per-doc labels you can join straight back onto the original corpus.

If you already have known metadata that you don't want rediscovered (existing category tags, publication year, source, whatever), you pass that in too and Typologist concept-erases it first via LEACE, so the facets it finds are orthogonal to what you already had.

Install

Requires Python 3.11+.

uv add git+https://github.com/stevenfazzio/typologist.git
# or: pip install git+https://github.com/stevenfazzio/typologist.git

You'll also want:

ANTHROPIC_API_KEY in the environment (or your own LLM callable for each of the three roles; see below).
A sentence-embedding model that Toponymy can use internally for keyphrases and topic names. sentence-transformers with MiniLM is cheap and good enough for most use cases:
```
uv pip install sentence-transformers
```

Quick start

import numpy as np
from sentence_transformers import SentenceTransformer
from typologist import Typologist

documents = [...]              # list[str], one per document
embeddings = np.array(...)     # shape (n_docs, d), float

t = Typologist(
    n_facets=3,
    topic_embedder=SentenceTransformer("all-MiniLM-L6-v2"),
).fit(documents, embeddings)

print(t.schema_)               # list[dict]: discovered facet definitions
print(t.labels_df_)            # (n_docs, n_facets) DataFrame of categorical labels

Discovery with metadata erasure

If your documents come with known metadata (source, category, year), you usually don't want Typologist to rediscover those axes. You want the facets it finds to be orthogonal to what you already have. Pass a metadata DataFrame and Typologist concept-erases those axes before running discovery.

import pandas as pd
from sentence_transformers import SentenceTransformer
from typologist import Typologist

df = pd.read_parquet("arxiv_sample.parquet")   # title, abstract, primary_category, ...
embeddings = np.load("arxiv_cohere_v4.npy")    # shape (len(df), 1536)
docs = (df["title"] + "\n\n" + df["abstract"]).rename("document")

t = Typologist(
    n_facets=3,
    topic_embedder=SentenceTransformer("all-MiniLM-L6-v2"),
    object_description="scientific paper",
    corpus_description="machine-learning arxiv papers",
    random_state=0,
    verbose=True,
).fit(
    docs,
    embeddings,
    metadata=df[["primary_category"]],
)

# Join labels back onto the original DataFrame
df_labeled = df.join(t.labels_df_)

# Inspect the schema
for facet in t.schema_:
    print(f"{facet['name']} ({len(facet['values'])} values): {facet['definition']}")
    for value in facet["values"]:
        print(f"  - {value}")

# Cross-tab a discovered facet against held-out metadata
pd.crosstab(df_labeled[t.schema_[0]["name"]], df_labeled["primary_category"])

Per-facet diagnostics (cluster counts, label entropy, exemplar documents) live on t.facet_diagnostics_.

Reusing a discovered schema

Every facet entry stores its own labeling_prompt_template and labeling_model, so you can apply a schema to new documents without re-running discovery:

from typologist import apply_schema

new_labels = apply_schema(schema=t.schema_, documents=new_docs)

See docs/design.md for the full schema entry shape and apply_schema contract.

Performance

Per-document labeling runs through a threadpool (max_concurrency=10 by default). On 1000 docs with n_facets=3 you should see roughly 6-8 minutes end to end. Toponymy's cluster naming and the schema-synthesis LLM calls are still serial; full async is a 0.2 item.

Typologist is an independent project with no affiliation to the authors of the libraries it builds on:

Toponymy: cluster naming and hierarchy
EVoC: hierarchical clustering
concept-erasure: LEACE implementation

If you want a 2D embedding projection with your Typologist labels on top, DataMapPlot is a natural match.

License

BSD-3-Clause. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typologist-0.0.1.tar.gz (188.9 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

typologist-0.0.1-py3-none-any.whl (15.3 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file typologist-0.0.1.tar.gz.

File metadata

Download URL: typologist-0.0.1.tar.gz
Upload date: Apr 22, 2026
Size: 188.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for typologist-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`a6e0aedf1762cdef147a62a199eac1eddce3300747f0514cbdbaf798b42df98d`
MD5	`b0cacdd24e6fa893ae43fd4b2609c620`
BLAKE2b-256	`70eb740141d13123dabeb9551e5ad04cfb26ed212a3f406bf5ecdc7cc4ea47c9`

See more details on using hashes here.

File details

Details for the file typologist-0.0.1-py3-none-any.whl.

File metadata

Download URL: typologist-0.0.1-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 15.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for typologist-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3d68d556963c263cce553a34c6737fae21680d9c890001e9eaffe46dbf85fe8`
MD5	`46d122299c74131c93dedbae3e92d39b`
BLAKE2b-256	`d416bcfafafb1078a3edf9fb151223612b020c8b4eead4ad6ad15f6d262fb7ec`

See more details on using hashes here.

typologist 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Typologist

Status

What it does

Install

Quick start

Discovery with metadata erasure

Reusing a discovered schema

Performance

Related

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes