A robust tool for the automated building of custom, domain-specific taxonomies using LLMs and GPU-accelerated clustering.

These details have not been verified by PyPI

Project links

Project description

TaxonomyBuilder: Building Domain-specific Taxonomies from the Ground Up

A robust, high-performance Python framework for transforming massive, unstructured text datasets into structured, hierarchical taxonomies. Originally published as part of the CustomNLP4U 2026 paper: Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

🚀 Overview

TaxonomyBuilder bridges the gap between raw data and structured knowledge. It leverages Sentence Transformers for semantic representation, RAPIDS (cuML) for GPU-accelerated clustering (if applicable), and LLMs (user-defined: OpenAI, Anthropic, etc.) for natural language categorization and recursive hierarchy building. The end product is a semantically meaningful taxonomy built from the ground up!

🛠 Installation

# Core installation
pip install taxonomybuilder

# For GPU acceleration (Requires CUDA)
pip install taxonomybuilder[gpu]

📖 Quick Start: The Full Pipeline

Using TaxonomyBuilder is straightforward and simple, and highly automated! Nevertheless, you are given the opportunity to inject your domain expertise.

Here is how to go from a list of raw strings to a multi-leveled hierarchy in minutes.

from TaxonomyBuilder import TaxonomyBuilder

# Step 1: Initialize (with GPU support) - make sure to specify your preferred sentence embedding model!
tb = TaxonomyBuilder(embedding_model_name="all-MiniLM-L6-v2", use_gpu=True)

# Step 2: Setup your preferred LLM Provider (currently supported: OpenAI, Google, Anthropic)
tb.set_llm(provider_name="openai", api_key="your-api-key", model_endpoint="gpt-4o-mini")

# Step 3: Ingest and Filter your Data. You are also encouraged to provide keywords to "anchor" the domain and filter out noise.
texts = ["Automate cloud backups to...", "Debug python scripts for...", "Develop machine learning solutions...", "Fix the broken coffee machine...", ...]
keywords = ["Software Engineering", "DevOps", "Programming"]

(tb.ingest_data(texts, keywords=keywords)
   .encode(batch_size=16)
   .filter_by_domain(percentile=25)) # Drop 25% least relevant texts, according to your defined domain

# Step 4: Build the Bottom Level of your Taxonomy via Clustering (soft cluster = include "noise" points)
tb.fit_clusters(n_components=10, min_cluster_size=5, soft_cluster=True)

# Step 5: Configure Labeling and Add Examples
(tb.configure_labeling(name="Technical Task", definition="A specific action performed by an engineer.")
   .add_label_example(["Fixing a syntax error", "Refactoring a loop", "Making sense of spaghetti code"], "Code Debugging"))

# Step 6: Generate Labels & Consolidate
tb.label_clusters()
tb.consolidate_labels(similarity_threshold=0.95) # this removes redundant cluster labels - optional!

# Step 7: Build the Hierarchical Taxonomy
tb.build_hierarchy(stop_at=10, max_levels=5) # Stops when the top level has 10 or fewer categories, OR when five levels have been built

# Step 8: Export Results (also check get_report and to_dataframe for exporting base level results)
df = tb.to_hierarchy_dataframe()
df.to_csv("taxonomy_results.csv")

🧠 Key Features

⚡ GPU-Accelerated Clustering

If a compatible GPU is detected, TaxonomyBuilder automatically uses cuML for UMAP and HDBSCAN, allowing you to cluster millions of documents in seconds rather than hours.

🎯 Two-Fold Domain Filtering

Our relevance scoring ensures your taxonomy isn't polluted by "off-topic" data. We score every text based on:

Mean Similarity: Average distance to all keywords.
Max Similarity: Highest match to any single keyword.

🌲 Recursive Hierarchical Logic

Unlike flat clustering, TaxonomyBuilder re-clusters the labels of the previous level to create a parent-child tree. It automatically switches prompts at the "Top Level" to ensure broad categories (e.g., "Operations") aren't labeled as granular tasks (e.g., "Password Reset").

📁 Project Structure

TaxonomyBuilder/
├── src/TaxonomyBuilder/
   ├── core.py           # Main Logic
   ├── clustering.py     # GPU/CPU Dispatcher (UMAP/HDBSCAN)
   ├── data.py           # PyTorch Dataset & Dataloaders
   ├── llm.py            # LLM Provider Interface
   └── prompt_utils.py   # Dynamic Template & Few-shot Logic

💡 Tips and Hints

Domain Context: We highly recommend seeding the process with domain-specific keywords. Additionally, make sure to add example for the labeling process (max. 3)!
Memory Management: If you have a massive dataset, set batch_size lower in .encode() to avoid out-of-memory issues.
Consolidation: If your taxonomy has too many "similar" sounding categories, lower the similarity_threshold in consolidate_labels to group more labels together.
The "Noise": Any text marked as -1 by HDBSCAN will be labeled as noise (-1) unless soft_cluster=True is used. It is up to you whether you want to include these points or not!

If you use or build upon TaxonomyBuilder, we would appreciate it if you cited the original work:

Bib entry coming soon!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonomybuilder-1.0.0.tar.gz (20.1 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

taxonomybuilder-1.0.0-py3-none-any.whl (16.8 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file taxonomybuilder-1.0.0.tar.gz.

File metadata

Download URL: taxonomybuilder-1.0.0.tar.gz
Upload date: May 20, 2026
Size: 20.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for taxonomybuilder-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f647a191b70d18b202b4e0206fe843b506576e77ce59431af47a9e5e2d41b356`
MD5	`e4ab5df9c1d63f084d33333402b9c59f`
BLAKE2b-256	`3d6c507e0dcfb71f2089029a445497598866c3a4b1a8969a07f63a2eb174c806`

See more details on using hashes here.

File details

Details for the file taxonomybuilder-1.0.0-py3-none-any.whl.

File metadata

Download URL: taxonomybuilder-1.0.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 16.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for taxonomybuilder-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd9dbf4a8ec33645c94bbdb89effee1ad025b9e4e87af372265f2022bd915163`
MD5	`7eefea5a61c60dccce971e9c13ad7f8f`
BLAKE2b-256	`c90b94f7617afcbaf2bc444d70c8bcc8beb37527fa1b399e980a89a27a4c93c4`

See more details on using hashes here.

TaxonomyBuilder 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TaxonomyBuilder: Building Domain-specific Taxonomies from the Ground Up

🚀 Overview

🛠 Installation

📖 Quick Start: The Full Pipeline

🧠 Key Features

⚡ GPU-Accelerated Clustering

🎯 Two-Fold Domain Filtering

🌲 Recursive Hierarchical Logic

📁 Project Structure

💡 Tips and Hints

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes