A theme extraction tool using LLMs

These details have not been verified by PyPI

Project description

themex

⚠️ Caution: This package is under active development and is currently not stable. Interfaces, file structure, and behaviour may change without notice.

themex is a flexible, modular framework designed to support large language model (LLM) tasks across social care, health, and research contexts — including thematic extraction, sentiment analysis, and more.

It supports both local HuggingFace models and remote APIs (such as Azure OpenAI), with configurable prompts, structured outputs, and logging.

📦 Installation

Install with Poetry:

poetry install

Or in editable mode:

pip install -e .

📁 Project Structure

llm-theme-miner/
├── poetry.lock
├── pyproject.toml
├── README.md
└── themex/
    ├── __init__.py
    ├── llm_runner/          # Core logic for calling LLMs
    ├── logger.py            # Logging utilities
    ├── paths.py             # Default paths and file naming logic
    ├── prompts/             # Prompt template files
    └── utils.py             # General utility functions

🚀 Quick Start

This framework is designed for flexible and extensible usage. Below are two minimal working examples.

Example 1 - Using a local HuggingFace model

from themex.llm_runner import run_llm
from pathlib import Path
from multiprocessing import Process

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
sys_tmpl = Path("./prompts/system_prompt.txt")
user_tmpl = Path("./prompts/theming_sentiment.txt")

p = Process(target=run_llm, kwargs={
    "execution_mode": "local",
    "provider": "huggingface",
    "model_id": model_id,
    "inputs": ["This is an example comment."],
    "sys_tmpl": sys_tmpl,
    "user_tmpl": user_tmpl,
    "gen_args": {
        "temperature": 0.7,
        "max_new_tokens": 300
    },
    "output_filename": "output.csv",
    "csv_logger_filepath": "log.csv",
    "extra_inputs": {
        "question": "What are the strengths and weaknesses in this case?",
        "domain": "Strength"
    }
})
p.start()
p.join()

Example 2 - Using Azure OpenAI remotely

p = Process(target=run_llm, kwargs={
    "execution_mode": "remote",
    "provider": "azure",
    "model_id": "gpt-4.1",
    "api_version": "2025-01-01-preview",
    "inputs": ["Another example comment."],
    "sys_tmpl": Path("./prompts/system_prompt.txt"),
    "user_tmpl": Path("./prompts/theming_sentiment.txt"),
    "gen_args": {
        "temperature": 0.4,
    },
    "output_filename": "azure_output.csv",
    "csv_logger_filepath": "azure_log.csv",
    "extra_inputs": {
        "question": "What are the strengths and weaknesses in this case?",
        "domain": "Strength"
    }
})
p.start()
p.join()

💡 Note on Multi-Process Execution

The examples use Python's multiprocessing.Process to run each task in a separate subprocess.

This is not mandatory, but can be helpful, particularly when using local models (e.g. with execution_mode="local").

Running in a subprocess ensures that memory (especially GPU memory) is fully released after the task completes, helping prevent memory leaks or out-of-memory errors during batch processing.

Feel free to adapt the structure for your own scheduling or orchestration needs.

📄 Output Format (Example)

🧠 Field Definitions

evidence: A verbatim quote from the original input text that supports or illustrates the identified topic. It serves as direct justification for the theme.
root_cause: If the impact is "negative", this field provides a short explanatory phrase reflecting the likely underlying structural, procedural, or systemic cause of the issue. It is not a restatement of the evidence, but an inferred explanation.

The framework saves structured outputs to CSV. Fields depend on prompt structure, but may include:

comment_id	model_id	domain	topic	evidence	impact	root_cause	sentiment
1	gpt-4.1	Strength	Family Contact Support	...	positive		positive

🧾 CSV Logger Output (Optional)

If csv_logger_filepath is specified, the framework will save an additional per-call log file capturing key runtime statistics, LLM behaviour, and inputs/outputs.

✅ When is it created?

Only when csv_logger_filepath is explicitly set in run_llm parameters
If omitted, no logger file is generated

📋 Example fields in the logger:

comment_id	context_len	current_mem_MB	do_sample	extra_fields	generated_token_len	increment_MB	input_len	input_token_len	max_new_tokens	model_id	output	peak_mem_MB	raw_output	system_prompt	temperature	tokens_per_sec	torch_dtype	total_time_sec	user_prompt
id		1.57		{"domain": "Strength"}	55	1.57	1	991		gpt-4.1	…	1.63	…	…	0.2	40.86	None	1.35	…

⚙️ Key Parameters

Parameter	Description
`execution_mode`	`"local"` or `"remote"`
`provider`	`"huggingface"` / `"azure"`
`model_id`	Model name or deployment ID
`api_version`	Azure API version if applicable
`inputs`	List of input strings
`sys_tmpl`	Path to system prompt
`user_tmpl`	Path to user prompt
`gen_args`	Dict of generation parameters (e.g. temperature, max_tokens)
`output_filename`	Where to save the result
`csv_logger_filepath`	Filepath for detailed logs
`extra_inputs`	Additional template fields (e.g. `domain`, `question`)

🧪 Development Status

This project is still in development. Breaking changes are likely.
Use with caution in production environments.

📬 Contact

To report bugs, request features, or contribute ideas, please open an issue on GitHub or contact the maintainer.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.0a1.post3 pre-release

Nov 14, 2025

0.1.0a1.post2 pre-release

Jun 19, 2025

0.1.0a1.post1 pre-release

Jun 6, 2025

This version

0.1.0a1 pre-release

Jun 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

themex-0.1.0a1.tar.gz (24.9 kB view details)

Uploaded Jun 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

themex-0.1.0a1-py3-none-any.whl (28.1 kB view details)

Uploaded Jun 6, 2025 Python 3

File details

Details for the file themex-0.1.0a1.tar.gz.

File metadata

Download URL: themex-0.1.0a1.tar.gz
Upload date: Jun 6, 2025
Size: 24.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.12.7 Darwin/24.5.0

File hashes

Hashes for themex-0.1.0a1.tar.gz
Algorithm	Hash digest
SHA256	`ea24decb8f7ba2409555414612de6b9867f49d3aabf6caa45585fa3425f9d396`
MD5	`609997c1a11a5cab07c25caaa0cb60ee`
BLAKE2b-256	`c54bccbda103e2bd17097591516b48533f798a958c469d0b0bb896ab30483551`

See more details on using hashes here.

File details

Details for the file themex-0.1.0a1-py3-none-any.whl.

File metadata

Download URL: themex-0.1.0a1-py3-none-any.whl
Upload date: Jun 6, 2025
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.12.7 Darwin/24.5.0

File hashes

Hashes for themex-0.1.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ec4f683fcb53f44a106e2bf9fe73f697234cede886768de77fb7cdd8a49f3ff`
MD5	`aed6d8af0bbad985882421474e2e2d63`
BLAKE2b-256	`9d92fae8bbedba8e361426799b8ac55186d905b65c93211b2677e3a97493ad7d`

See more details on using hashes here.

themex 0.1.0a1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

themex

📦 Installation

📁 Project Structure

🚀 Quick Start

Example 1 - Using a local HuggingFace model

Example 2 - Using Azure OpenAI remotely

💡 Note on Multi-Process Execution

📄 Output Format (Example)

🧠 Field Definitions

🧾 CSV Logger Output (Optional)

✅ When is it created?

📋 Example fields in the logger:

⚙️ Key Parameters

🧪 Development Status

📬 Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes