No project description provided
Project description
Thematic Search
Thematic Search is a Python package for thematic search on document collections with a hierarchical topic model. It lets you find the most specific topic covering a set of documents, navigate a topic hierarchy, and chain semantic and thematic queries together.
Full documentation is available on ReadTheDocs.
Installation
This is an alpha release; install from source:
git clone git@github.com:kalebruscitti/thematic-search.git
pip install thematic-search
Basic Usage
What you need
To initialize a TopicDatabase you need:
embedding_vectors: an(n_docs, d)float array of document embeddingscluster_tree: a dictionary{node: [children]}representing your topic hierarchy, where nodes can be any hashable labels (strings, ints, etc.)cluster_layers: a list of(n_docs, n_clusters)float arrays in[0,1], one per layer, wherecluster_layers[l][j, i]is the inclusion strength of documentjin thei-th cluster at layerl
Optionally:
topic_metadata: aDataFramewith a row for each node incluster_tree, indexed by the same node labelsdocument_metadata: aDataFramewith a row for each documentreduced_vectors: an(n_docs, 2)array of low-dimensional vectors for visualisation
Converting your cluster tree
The convert_tree utility converts your tree from arbitrary node labels into the internal format required by SoftClusterTree, and returns a cluster_labels mapping that allows TopicDatabase to automatically align your topic_metadata:
from thematic_search.utilities import convert_tree
cluster_tree, cluster_labels = convert_tree(my_tree)
Layers are assigned automatically (leaves at layer 0, each internal node one layer above its deepest child), or you can supply a custom layers dictionary.
Initializing a TopicDatabase
from thematic_search import TopicDatabase, SoftClusterTree
topicdb = TopicDatabase(
SoftClusterTree(cluster_layers, cluster_tree),
embedding_vectors=embedding_vectors,
reduced_vectors=reduced_vectors, # optional
sample_df=document_metadata, # optional
topic_df=topic_metadata, # indexed by your node labels
cluster_labels=cluster_labels, # from convert_tree
)
If you want to use topicdb.q.search(), you will also need to provide an embedding_model — a SentenceTransformer model matching the one used to produce embedding_vectors — either at construction time or by setting topicdb.embedding_model before calling search().
Querying
Queries are accessed via topicdb.q and are chainable. The full set of composable queries is given by the arrows in the schema diagram:
Some examples:
# Documents nearest to a query string in embedding space
topicdb.q.neighbours("Advancements in space technology").metadata()
# Most specific topic covering those nearest neighbours
topicdb.q.neighbours("Advancements in space technology").theme().metadata()
# Documents inside a named topic with at least 75% inclusion strength
topicdb.q.topic_name("science").samples(min_strength=0.75).metadata()
# Chain queries: theme of documents inside the parent of a known topic
topicdb.q.topic_name("physics").parents().samples().theme().metadata()
Toponymy Integration
Thematic Search is designed to work out-of-the-box with topic models generated by Toponymy. Given a fitted Toponymy object, the from_topic_model class method handles the conversion directly:
from toponymy.serialization import TopicModel
from thematic_search import TopicDatabase
topic_model = TopicModel.from_toponymy(toponymy, sample_df=my_document_metadata)
topicdb = TopicDatabase.from_topic_model(topic_model)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thematic_search-0.1.0.tar.gz.
File metadata
- Download URL: thematic_search-0.1.0.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3651efec7ad2ab0ab88b1ab9c4effd88f904ac03346fe208da5ededf3482b8dc
|
|
| MD5 |
7cb0aca9af6e79613efb58c32389eabc
|
|
| BLAKE2b-256 |
e6a20649e7799d94ba467f8c8015a4446dfaa154566c271b594662b89966c67f
|
File details
Details for the file thematic_search-0.1.0-py3-none-any.whl.
File metadata
- Download URL: thematic_search-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c95655af3f1d52acb49fa00aea0bfe3b5169e228a71578836b37612e897ec8bb
|
|
| MD5 |
7eb26c0b39d5aced9ddc13c1de16cdc0
|
|
| BLAKE2b-256 |
268a0311f0a8d9f974ee1e42fc652430e457f39004ebcb598a954b11f198ffce
|