Skip to main content

A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.

Project description

ThemeFinder

ThemeFinder is a topic modelling Python package designed for analysing one-to-many question-answer data (i.e. survey responses, public consultations, etc.). See the docs for more info.

[!IMPORTANT] Incubation project: This project is an incubation project; as such, we don't recommend using this for critical use cases yet. We are currently in a research stage, trialling the tool for case studies across the Civil Service. Find out more about our projects at https://ai.gov.uk/.

Quickstart

Install using your package manager of choice

For example pip install themefinder or uv add themefinder.

Usage

ThemeFinder takes as input a pandas DataFrame with two columns:

  • response_id: A unique identifier for each response
  • response: The free text survey response

ThemeFinder now supports a range of language models through structured outputs.

The function find_themes identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.

For this example, import the following Python packages into your virtual environment: asyncio, pandas, lanchain. And import themefinder as described above.

If you are using environment variables (eg for API keys), you can use python-dotenv to read variables from a .env file.

If you are using an Azure OpenAI endpoint, you will need the following variables:

  • AZURE_OPENAI_API_KEY
  • AZURE_OPENAI_ENDPOINT
  • OPENAI_API_VERSION
  • DEPLOYMENT_NAME
  • AZURE_OPENAI_BASE_URL

Otherwise you will need whichever variables LangChain requires for your LLM of choice.

import asyncio
from dotenv import load_dotenv
import pandas as pd
from langchain_openai import AzureChatOpenAI
from themefinder import find_themes

# If needed, load LLM API settings from .env file
load_dotenv()

# Initialise your LLM of choice using langchain
llm = AzureChatOpenAI(
    model="gpt-4o",
    temperature=0,
)

# Set up your data
responses_df = pd.DataFrame({
   "response_id": ["1", "2", "3", "4", "5"],
   "response": ["I think it's awesome, I can use it for consultation analysis.", 
   "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
})

# Add your question
question = "What do you think of ThemeFinder?"

# Make the system prompt specific to your use case 
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."

# Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
async def main():
    result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

ThemeFinder pipeline

ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:

Sentiment analysis

  • Analyses the emotional tone and position of each response using sentiment-focused prompts
  • Provides structured sentiment categorisation based on LLM analysis

Theme generation

  • Uses exploratory prompts to identify initial themes from response batches
  • Groups related responses for better context through guided theme extraction

Theme condensation

  • Employs comparative prompts to combine similar or overlapping themes
  • Reduces redundancy in identified topics through systematic theme evaluation

Theme refinement

  • Leverages standardisation prompts to normalise theme descriptions
  • Creates clear, consistent theme definitions through structured refinement

Theme target alignment

  • Optional step to consolidate themes down to a target number

Theme mapping

  • Utilizes classification prompts to map individual responses to refined themes
  • Supports multiple theme assignments per response through detailed analysis

The prompts used at each stage can be found in src/themefinder/prompts/.

The file src/themefinder.core.py contains the function find_themes which runs the pipline. It also contains functions fo each individual stage.

For more detail - see the docs: https://i-dot-ai.github.io/themefinder/.

Model Compatibility

ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:

OpenAI Models

  • GPT-4, GPT-4o, GPT-4.1
  • All Azure OpenAI deployments

Google Models

  • Gemini series (1.5 Pro, 2.0 Pro, etc.)

Anthropic Models

  • Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)

Open Source Models

  • Llama 2, Llama 3
  • Mistral models (e.g., Mistral 7B, Mixtral)

Development

This project uses uv for dependency management.

# Clone and set up development environment
git clone https://github.com/i-dot-ai/themefinder.git
cd themefinder
uv venv
source .venv/bin/activate
uv pip sync requirements-dev.txt
uv pip install -e .

# Run tests
pytest tests/

# Run linting
pre-commit run --all-files

License

This project is licensed under the MIT License - see the LICENSE file for details.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Feedback

Contact us with questions or feedback at packages@cabinetoffice.gov.uk.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

themefinder-0.8.1.tar.gz (38.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

themefinder-0.8.1-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file themefinder-0.8.1.tar.gz.

File metadata

  • Download URL: themefinder-0.8.1.tar.gz
  • Upload date:
  • Size: 38.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for themefinder-0.8.1.tar.gz
Algorithm Hash digest
SHA256 b250fe209412fa90701d05c74f8b4ac45b381e2cdb327b04c216e52ab9bee126
MD5 2591f2a0d63bc251f3a17dafc01917fe
BLAKE2b-256 a5064f04a7487b57132f884dc2ed16c4cf8fe9562e865d14b1a39cb044ed7210

See more details on using hashes here.

Provenance

The following attestation bundles were made for themefinder-0.8.1.tar.gz:

Publisher: deploy-pypi.yml on i-dot-ai/themefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file themefinder-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: themefinder-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for themefinder-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1791e8d79b6cf527bc0eda0a5dfca2da10bd786ce9c90b755b36ec78090d53a5
MD5 3e3931f2558ac27ac35ec1363d62e9f4
BLAKE2b-256 91b8a01ca6146dced6987f45265acf6f9f0b7d67a81d4c2c1ce8932491a5d927

See more details on using hashes here.

Provenance

The following attestation bundles were made for themefinder-0.8.1-py3-none-any.whl:

Publisher: deploy-pypi.yml on i-dot-ai/themefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page