Skip to main content

A theme extraction tool using LLMs

Project description

themex

⚠️ Caution: This package is under active development and is currently not stable. Interfaces, file structure, and behaviour may change without notice.

themex is a flexible, modular framework designed to support large language model (LLM) tasks across social care, health, and research contexts — including thematic extraction, sentiment analysis, and more.

It supports both local HuggingFace models and remote APIs (such as Azure OpenAI), with configurable prompts, structured outputs, and logging.


📦 Installation

Install with Poetry:

poetry install

Or in editable mode:

pip install -e .

📁 Project Structure

llm-theme-miner/
├── poetry.lock
├── pyproject.toml
├── README.md
└── themex/
    ├── __init__.py
    ├── llm_runner/          # Core logic for calling LLMs
    ├── logger.py            # Logging utilities
    ├── paths.py             # Default paths and file naming logic
    ├── prompts/             # Prompt template files
    └── utils.py             # General utility functions

🚀 Quick Start

This framework is designed for flexible and extensible usage. Below are two minimal working examples.

Example 1 - Using a local HuggingFace model

from themex.llm_runner import run_llm
from pathlib import Path
from multiprocessing import Process

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
sys_tmpl = Path("./prompts/system_prompt.txt")
user_tmpl = Path("./prompts/theming_sentiment.txt")

p = Process(target=run_llm, kwargs={
    "execution_mode": "local",
    "provider": "huggingface",
    "model_id": model_id,
    "inputs": ["This is an example comment."],
    "sys_tmpl": sys_tmpl,
    "user_tmpl": user_tmpl,
    "gen_args": {
        "temperature": 0.7,
        "max_new_tokens": 300
    },
    "output_filename": "output.csv",
    "csv_logger_filepath": "log.csv",
    "extra_inputs": {
        "question": "What are the strengths and weaknesses in this case?",
        "domain": "Strength"
    }
})
p.start()
p.join()

Example 2 - Using Azure OpenAI remotely

p = Process(target=run_llm, kwargs={
    "execution_mode": "remote",
    "provider": "azure",
    "model_id": "gpt-4.1",
    "api_version": "2025-01-01-preview",
    "inputs": ["Another example comment."],
    "sys_tmpl": Path("./prompts/system_prompt.txt"),
    "user_tmpl": Path("./prompts/theming_sentiment.txt"),
    "gen_args": {
        "temperature": 0.4,
    },
    "output_filename": "azure_output.csv",
    "csv_logger_filepath": "azure_log.csv",
    "extra_inputs": {
        "question": "What are the strengths and weaknesses in this case?",
        "domain": "Strength"
    }
})
p.start()
p.join()

💡 Note on Multi-Process Execution

The examples use Python's multiprocessing.Process to run each task in a separate subprocess.

This is not mandatory, but can be helpful, particularly when using local models (e.g. with execution_mode="local").

Running in a subprocess ensures that memory (especially GPU memory) is fully released after the task completes, helping prevent memory leaks or out-of-memory errors during batch processing.

Feel free to adapt the structure for your own scheduling or orchestration needs.


📄 Output Format (Example)

🧠 Field Definitions

  • evidence: A verbatim quote from the original input text that supports or illustrates the identified topic. It serves as direct justification for the theme.
  • root_cause: If the impact is "negative", this field provides a short explanatory phrase reflecting the likely underlying structural, procedural, or systemic cause of the issue. It is not a restatement of the evidence, but an inferred explanation.

The framework saves structured outputs to CSV. Fields depend on prompt structure, but may include:

comment_id model_id domain topic evidence impact root_cause sentiment
1 gpt-4.1 Strength Family Contact Support ... positive positive

🧾 CSV Logger Output (Optional)

If csv_logger_filepath is specified, the framework will save an additional per-call log file capturing key runtime statistics, LLM behaviour, and inputs/outputs.

✅ When is it created?

  • Only when csv_logger_filepath is explicitly set in run_llm parameters
  • If omitted, no logger file is generated

📋 Example fields in the logger:

comment_id context_len current_mem_MB do_sample extra_fields generated_token_len increment_MB input_len input_token_len max_new_tokens model_id output peak_mem_MB raw_output system_prompt temperature tokens_per_sec torch_dtype total_time_sec user_prompt
id 1.57 {"domain": "Strength"} 55 1.57 1 991 gpt-4.1 1.63 0.2 40.86 None 1.35

⚙️ Key Parameters

Parameter Description
execution_mode "local" or "remote"
provider "huggingface" / "azure"
model_id Model name or deployment ID
api_version Azure API version if applicable
inputs List of input strings
sys_tmpl Path to system prompt
user_tmpl Path to user prompt
gen_args Dict of generation parameters (e.g. temperature, max_tokens)
output_filename Where to save the result
csv_logger_filepath Filepath for detailed logs
extra_inputs Additional template fields (e.g. domain, question)

🧪 Development Status

This project is still in development. Breaking changes are likely.
Use with caution in production environments.


📬 Contact

To report bugs, request features, or contribute ideas, please open an issue on GitHub or contact the maintainer.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

themex-0.1.0a1.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

themex-0.1.0a1-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file themex-0.1.0a1.tar.gz.

File metadata

  • Download URL: themex-0.1.0a1.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.7 Darwin/24.5.0

File hashes

Hashes for themex-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 ea24decb8f7ba2409555414612de6b9867f49d3aabf6caa45585fa3425f9d396
MD5 609997c1a11a5cab07c25caaa0cb60ee
BLAKE2b-256 c54bccbda103e2bd17097591516b48533f798a958c469d0b0bb896ab30483551

See more details on using hashes here.

File details

Details for the file themex-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: themex-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.7 Darwin/24.5.0

File hashes

Hashes for themex-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 9ec4f683fcb53f44a106e2bf9fe73f697234cede886768de77fb7cdd8a49f3ff
MD5 aed6d8af0bbad985882421474e2e2d63
BLAKE2b-256 9d92fae8bbedba8e361426799b8ac55186d905b65c93211b2677e3a97493ad7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page