A theme extraction tool using LLMs
Project description
themex
⚠️ Caution: This package is under active development and is currently not stable. Interfaces, file structure, and behaviour may change without notice.
themex is a flexible, modular framework designed to support large language model (LLM) tasks across social care, health, and research contexts — including thematic extraction, sentiment analysis, and more.
It supports both local HuggingFace models and remote APIs (such as Azure OpenAI), with configurable prompts, structured outputs, and logging.
📦 Installation
Install with Poetry:
poetry install
Or in editable mode:
pip install -e .
📁 Project Structure
llm-theme-miner/
├── poetry.lock
├── pyproject.toml
├── README.md
└── themex/
├── __init__.py
├── llm_runner/ # Core logic for calling LLMs
├── logger.py # Logging utilities
├── paths.py # Default paths and file naming logic
├── prompts/ # Prompt template files
└── utils.py # General utility functions
🚀 Quick Start
This framework is designed for flexible and extensible usage. Below are two minimal working examples.
Example 1 - Using a local HuggingFace model
from themex.llm_runner import run_llm
from pathlib import Path
from multiprocessing import Process
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
sys_tmpl = Path("./prompts/system_prompt.txt")
user_tmpl = Path("./prompts/theming_sentiment.txt")
p = Process(target=run_llm, kwargs={
"execution_mode": "local",
"provider": "huggingface",
"model_id": model_id,
"inputs": ["This is an example comment."],
"sys_tmpl": sys_tmpl,
"user_tmpl": user_tmpl,
"gen_args": {
"temperature": 0.7,
"max_new_tokens": 300
},
"output_filename": "output.csv",
"csv_logger_filepath": "log.csv",
"extra_inputs": {
"question": "What are the strengths and weaknesses in this case?",
"domain": "Strength"
}
})
p.start()
p.join()
Example 2 - Using Azure OpenAI remotely
p = Process(target=run_llm, kwargs={
"execution_mode": "remote",
"provider": "azure",
"model_id": "gpt-4.1",
"api_version": "2025-01-01-preview",
"inputs": ["Another example comment."],
"sys_tmpl": Path("./prompts/system_prompt.txt"),
"user_tmpl": Path("./prompts/theming_sentiment.txt"),
"gen_args": {
"temperature": 0.4,
},
"output_filename": "azure_output.csv",
"csv_logger_filepath": "azure_log.csv",
"extra_inputs": {
"question": "What are the strengths and weaknesses in this case?",
"domain": "Strength"
}
})
p.start()
p.join()
💡 Note on Multi-Process Execution
The examples use Python's multiprocessing.Process to run each task in a separate subprocess.
This is not mandatory, but can be helpful, particularly when using local models (e.g. with execution_mode="local").
Running in a subprocess ensures that memory (especially GPU memory) is fully released after the task completes, helping prevent memory leaks or out-of-memory errors during batch processing.
Feel free to adapt the structure for your own scheduling or orchestration needs.
📄 Output Format (Example)
🧠 Field Definitions
evidence: A verbatim quote from the original input text that supports or illustrates the identifiedtopic. It serves as direct justification for the theme.root_cause: If theimpactis"negative", this field provides a short explanatory phrase reflecting the likely underlying structural, procedural, or systemic cause of the issue. It is not a restatement of the evidence, but an inferred explanation.
The framework saves structured outputs to CSV. Fields depend on prompt structure, but may include:
| comment_id | model_id | domain | topic | evidence | impact | root_cause | sentiment |
|---|---|---|---|---|---|---|---|
| 1 | gpt-4.1 | Strength | Family Contact Support | ... | positive | positive |
🧾 CSV Logger Output (Optional)
If csv_logger_filepath is specified, the framework will save an additional per-call log file capturing key runtime statistics, LLM behaviour, and inputs/outputs.
✅ When is it created?
- Only when
csv_logger_filepathis explicitly set inrun_llmparameters - If omitted, no logger file is generated
📋 Example fields in the logger:
| comment_id | context_len | current_mem_MB | do_sample | extra_fields | generated_token_len | increment_MB | input_len | input_token_len | max_new_tokens | model_id | output | peak_mem_MB | raw_output | system_prompt | temperature | tokens_per_sec | torch_dtype | total_time_sec | user_prompt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 1.57 | {"domain": "Strength"} | 55 | 1.57 | 1 | 991 | gpt-4.1 | … | 1.63 | … | … | 0.2 | 40.86 | None | 1.35 | … |
⚙️ Key Parameters
| Parameter | Description |
|---|---|
execution_mode |
"local" or "remote" |
provider |
"huggingface" / "azure" |
model_id |
Model name or deployment ID |
api_version |
Azure API version if applicable |
inputs |
List of input strings |
sys_tmpl |
Path to system prompt |
user_tmpl |
Path to user prompt |
gen_args |
Dict of generation parameters (e.g. temperature, max_tokens) |
output_filename |
Where to save the result |
csv_logger_filepath |
Filepath for detailed logs |
extra_inputs |
Additional template fields (e.g. domain, question) |
🧪 Development Status
This project is still in development. Breaking changes are likely.
Use with caution in production environments.
📬 Contact
To report bugs, request features, or contribute ideas, please open an issue on GitHub or contact the maintainer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file themex-0.1.0a1.tar.gz.
File metadata
- Download URL: themex-0.1.0a1.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.7 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea24decb8f7ba2409555414612de6b9867f49d3aabf6caa45585fa3425f9d396
|
|
| MD5 |
609997c1a11a5cab07c25caaa0cb60ee
|
|
| BLAKE2b-256 |
c54bccbda103e2bd17097591516b48533f798a958c469d0b0bb896ab30483551
|
File details
Details for the file themex-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: themex-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.7 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ec4f683fcb53f44a106e2bf9fe73f697234cede886768de77fb7cdd8a49f3ff
|
|
| MD5 |
aed6d8af0bbad985882421474e2e2d63
|
|
| BLAKE2b-256 |
9d92fae8bbedba8e361426799b8ac55186d905b65c93211b2677e3a97493ad7d
|