Skip to main content

A multi-agent LLM topic modeling library.

Project description

MALTopic: Multi-Agent LLM Topic Modeling Library

MALTopic is a powerful library designed for topic modeling using a multi-agent approach. It leverages the capabilities of large language models (LLMs) to enhance the analysis of survey responses by integrating structured and unstructured data.

MALTopic as a research paper was published in 2025 World AI-IoT Congress.

Features

  • Multi-Agent Framework: Decomposes topic modeling into specialized tasks executed by individual LLM agents.
  • Data Enrichment: Enhances textual responses using structured and categorical survey data.
  • Latent Theme Extraction: Extracts meaningful topics from enriched responses.
  • Topic Deduplication: Intelligently refines and consolidates identified topics using LLM-powered semantic analysis for better interpretability.
  • Automatic Batching: Handles large datasets by automatically splitting data into manageable batches when token limits are exceeded.
  • Intelligent Error Handling: Detects token limit errors and seamlessly switches to batching mode without user intervention.
  • Comprehensive Statistics Tracking: Automatically tracks LLM usage, token consumption, API performance, and costs with detailed metrics and reporting.

Installation

To install the MALTopic library, you can use pip:

pip install maltopic

GUI (No-Code Interface)

MALTopic includes a built-in web-based GUI for non-technical users. After installing, simply run:

maltopic-gui

This launches a step-by-step wizard in your browser where you can:

  1. Configure your API key, model, and LLM provider
  2. Upload a CSV file and select your free-text and structured data columns
  3. Enrich free-text responses with structured data context
  4. Generate topics from the enriched data
  5. Deduplicate similar topics (optional)
  6. Export results — enriched CSV, topics JSON, and usage stats

All processing happens locally on your machine. The only data sent externally is to your configured LLM provider (e.g. OpenAI) for enrichment and topic mining. You can also run the GUI with python -m maltopic.gui.

Usage

To use the MALTopic library, you need to initialize the main class with your API key and model name. You can choose between different LLMs such as OpenAI, Google Gemini (not supported yet), or Llama (not supported yet).

from maltopic import MALTopic

# Initialize the MALTopic class
client = MALTopic(
    api_key="your_api_key",
    default_model_name="gpt-4.1-nano",
    llm_type="openai",    # Optional: override default model parameters
    override_model_params=None,  # Use None for automatic parameter handling)

enriched_df = client.enrich_free_text_with_structured_data(
        survey_context="context about survey, why, how of it...",
        free_text_column="column_1",
        structured_data_columns=["columns_2", "column_3"],
        df=df,
        examples=["free text response, category 1 -> free text response with additional context", "..."], # optional
    )

topics = client.generate_topics(
        topic_mining_context="context about what kind of topics you want to mine",
        df=enriched_df,
        enriched_column="column_1" + "_enriched", # MALTopic adds _enriched as the suffix.
    )

# Optionally deduplicate and merge similar topics for cleaner results
deduplicated_topics = client.deduplicate_topics(
        topics=topics,
        survey_context="context about survey, why, how of it...",
    )

print(deduplicated_topics)

# Access comprehensive statistics anytime
stats = client.get_stats()
print(f"Total tokens used: {stats['overview']['total_tokens_used']:,}")
print(f"API calls made: {stats['overview']['total_calls_made']}")
print(f"Success rate: {stats['overview']['success_rate_percent']}%")

# Print detailed formatted statistics
client.print_stats()

Automatic Batching for Large Datasets

MALTopic v1.1.0 introduces intelligent automatic batching to handle large datasets that may exceed LLM token limits. This feature works seamlessly in the background:

How It Works

  1. Automatic Detection: When generate_topics encounters a token limit error, it automatically detects this and switches to batching mode.

  2. Smart Splitting: The library uses tiktoken (OpenAI's token counting library) to intelligently split your data into optimally-sized batches based on actual token counts.

  3. Batch Processing: Each batch is processed independently, with progress tracking to keep you informed.

  4. Topic Consolidation: Topics from all batches are automatically merged and deduplicated to provide a clean, comprehensive result.

Key Benefits

  • No Code Changes Required: Existing code works without modification - batching happens automatically when needed.
  • Optimal Performance: Uses actual token counting for precise batch sizing, maximizing efficiency.
  • Robust Fallback: Even works without tiktoken by falling back to simple batch splitting.
  • Progress Visibility: Shows batch processing progress so you know what's happening.
  • Quality Preservation: Maintains topic quality through intelligent consolidation and deduplication.

Example Output

When batching is triggered, you'll see output like:

Token limit exceeded, splitting into batches...
Processing 3 batches...
Processing batches: 100%|██████████| 3/3 [00:45<00:00, 15.2s/it]
Batch 1/3: Generated 12 topics
Batch 2/3: Generated 8 topics  
Batch 3/3: Generated 10 topics
Consolidated 30 topics into 25 unique topics

This feature makes MALTopic suitable for processing large-scale survey datasets without worrying about token limitations.

Intelligent Topic Deduplication

MALTopic v1.2.0 introduces intelligent topic deduplication that goes beyond simple string matching to provide semantic analysis and consolidation of similar topics.

How It Works

  1. Semantic Analysis: Uses LLM to analyze topic meanings, descriptions, and context rather than just comparing names.

  2. Smart Merging: Identifies topics with significant semantic overlap (>80% similarity) and intelligently merges them while preserving unique perspectives.

  3. Structure Preservation: Maintains the original topic structure and combines information from merged topics:

    • Names: Chooses the most descriptive and comprehensive name
    • Descriptions: Combines descriptions to capture all relevant aspects
    • Relevance: Merges relevance information from all source topics
    • Representative Words: Combines word lists, removing duplicates
  4. Quality Preservation: Preserves genuinely unique topics that represent distinct concepts with no significant overlap.

Key Benefits

  • Higher Quality Results: Eliminates redundant or highly similar topics for cleaner analysis
  • Semantic Understanding: Goes beyond keyword matching to understand topic meanings
  • Flexible Control: Can be used optionally - existing workflows continue to work unchanged
  • Robust Fallback: Returns original topics unchanged if deduplication fails
  • Context-Aware: Uses survey context to make better merging decisions

Usage Example

# Generate topics as usual
topics = client.generate_topics(
    topic_mining_context="Extract themes from customer feedback",
    df=enriched_df,
    enriched_column="feedback_enriched"
)

# Apply intelligent deduplication
deduplicated_topics = client.deduplicate_topics(
    topics=topics,
    survey_context="Customer satisfaction survey for mobile app"
)

print(f"Original topics: {len(topics)}")
print(f"After deduplication: {len(deduplicated_topics)}")

Example Output

When deduplication is applied, you'll see output like:

Deduplicated 15 topics into 12 unique topics

This feature is particularly useful when:

  • Working with large datasets that produce many overlapping topics
  • You need cleaner, more consolidated results for reporting
  • Multiple batches have generated similar topics that need consolidation

Comprehensive Statistics Tracking

MALTopic includes built-in statistics tracking that automatically monitors your LLM usage, providing valuable insights into token consumption, API performance, and costs.

Key Metrics Tracked

  • Token Usage: Input, output, and total tokens from all API calls
  • API Performance: Call counts, success/failure rates, and response times
  • Model Breakdown: Statistics separated by each model used
  • Cost Monitoring: Data needed to calculate estimated API costs
  • Real-time Updates: Statistics update automatically as you use the library

Accessing Statistics

MALTopic provides three simple methods to access your usage statistics:

# Get comprehensive statistics as a dictionary
stats = client.get_stats()
print(f"Total tokens used: {stats['overview']['total_tokens_used']:,}")
print(f"Average response time: {stats['averages']['avg_response_time_seconds']:.2f}s")

# Print a formatted summary to console
client.print_stats()

# Reset statistics to start fresh
client.reset_stats()

Example Statistics Output

When you call client.print_stats(), you'll see output like:

============================================================
MALTopic Library Usage Statistics
============================================================

📊 Overview:
  Total Tokens Used: 2,450
  - Input Tokens: 1,800
  - Output Tokens: 650
  Total API Calls: 8
  - Successful: 8
  - Failed: 0
  Success Rate: 100.0%
  Uptime: 125.3 seconds

📈 Averages:
  Avg Tokens per Call: 306.3
  - Avg Input Tokens: 225.0
  - Avg Output Tokens: 81.3
  Avg Response Time: 2.15s

🤖 Model Breakdown:
  gpt-4:
    Calls: 8 (Success: 8, Failed: 0)
    Tokens: 2,450 (Avg: 306.3)
    Success Rate: 100.0%
============================================================

Cost Estimation Example

Use the statistics to estimate your API costs:

stats = client.get_stats()

# Example with GPT-4 pricing (as of 2024)
input_cost = (stats['overview']['total_input_tokens'] / 1000) * 0.03  # $0.03 per 1K input tokens
output_cost = (stats['overview']['total_output_tokens'] / 1000) * 0.06  # $0.06 per 1K output tokens
total_estimated_cost = input_cost + output_cost

print(f"Estimated API cost: ${total_estimated_cost:.4f}")

Benefits

  • Cost Control: Monitor token usage to manage API expenses
  • Performance Optimization: Identify bottlenecks and optimize prompts
  • Error Monitoring: Track success rates to catch issues early
  • Usage Insights: Understand patterns across different models and operations

Statistics tracking is automatic and privacy-focused - no data leaves your environment, and statistics are stored only in memory during your session.

Model Parameter Control

MALTopic provides flexible control over OpenAI API parameters through the override_model_params parameter. This gives you fine-grained control when needed while maintaining smart defaults.

Automatic Parameter Handling (Default)

By default (override_model_params=None), MALTopic automatically handles parameters based on the model type:

  • Regular models (gpt-4, gpt-4o, etc.): Uses temperature=0.2, top_p=0.9, seed=12345
  • Reasoning models (o1, o3, gpt-5 series): Automatically excludes unsupported parameters like temperature, top_p, and seed
# Automatic handling - recommended for most users
client = MALTopic(
    api_key="your_api_key",
    default_model_name="gpt-4",
    llm_type="openai"
    # override_model_params=None is the default
)

Custom Parameters

You can override the default parameters by providing a dictionary. This completely replaces the default parameters:

# Use custom parameters
client = MALTopic(
    api_key="your_api_key",
    default_model_name="gpt-4",
    llm_type="openai",
    override_model_params={
        "temperature": 0.8,
        "max_tokens": 500,
        "top_p": 0.95,
        "frequency_penalty": 0.5,
    }
)

Reasoning Models with Custom Parameters

For reasoning models (o1, o3, gpt-5 series), you can specify allowed parameters:

# Only specify supported parameters for reasoning models
client = MALTopic(
    api_key="your_api_key",
    default_model_name="gpt-5-mini",
    llm_type="openai",
    override_model_params={
        "max_tokens": 1000,
        # Don't include temperature, top_p, or seed for reasoning models
    }
)

Minimal Parameters

Use an empty dictionary to send only the base required parameters:

# Minimal parameters - only model, messages, and store
client = MALTopic(
    api_key="your_api_key",
    default_model_name="gpt-4",
    llm_type="openai",
    override_model_params={}
)

Key Points

  • None (default): Automatic intelligent parameter handling based on model type
  • Dictionary: Your parameters completely replace the defaults
  • Empty dict {}: Only base required parameters are sent
  • Flexibility: Works with all OpenAI models, including future models

When to Use Custom Parameters

  • Experimentation: Testing different temperature or sampling settings
  • Specific Requirements: Your use case requires particular parameter values
  • Token Limits: Need to set max_tokens for cost control
  • Advanced Features: Using OpenAI features like frequency_penalty or presence_penalty

For most users, the default automatic handling (override_model_params=None) is recommended as it ensures compatibility with all model types.

Method Reference

Initialization

MALTopic()

Initializes the MALTopic client with API credentials and configuration.

Parameters:

  • api_key (str): Your OpenAI API key
  • default_model_name (str): Model to use (e.g., "gpt-4", "gpt-5-mini", "o1-preview")
  • llm_type (str): LLM provider type (currently only "openai" is supported)
  • override_model_params (dict | None, optional): Custom parameters to override defaults.
    • None (default): Automatic parameter handling based on model type
    • dict: Your parameters completely replace defaults
    • {}: Only base required parameters

Returns: MALTopic instance

Core Methods

enrich_free_text_with_structured_data()

Enhances free-text survey responses with structured data context.

Parameters:

  • survey_context (str): Context about the survey purpose and methodology
  • free_text_column (str): Name of the column containing free-text responses
  • structured_data_columns (list[str]): List of column names with structured data to use for enrichment
  • df (pandas.DataFrame): DataFrame containing the survey data
  • examples (list[str], optional): Examples of enrichment format

Returns: DataFrame with enriched text in a new column with "_enriched" suffix

generate_topics()

Extracts latent themes and topics from enriched survey responses.

Parameters:

  • topic_mining_context (str): Context about what kind of topics to extract
  • df (pandas.DataFrame): DataFrame containing enriched data
  • enriched_column (str): Name of the column containing enriched text

Returns: List of topic dictionaries with structure:

{
    "name": "Topic Name",
    "description": "Detailed description of the topic",
    "relevance": "Who this topic is relevant to",
    "representative_words": ["word1", "word2", "word3"]
}

deduplicate_topics()

Intelligently consolidates similar topics using semantic analysis.

Parameters:

  • topics (list[dict]): List of topic dictionaries to deduplicate
  • survey_context (str): Context about the survey to help with merging decisions

Returns: List of deduplicated topic dictionaries with the same structure as input

get_stats()

Returns comprehensive statistics about LLM usage and performance.

Returns: Dictionary containing:

  • overview: Total tokens, calls, success rates, and uptime
  • averages: Average tokens per call, response times, etc.
  • model_breakdown: Statistics separated by model
  • recent_calls: Details of the most recent API calls

print_stats()

Prints a formatted summary of statistics to the console.

Returns: None (prints to console)

reset_stats()

Resets all statistics to zero and starts tracking fresh.

Returns: None

Agents

  • Enrichment Agent: Enhances free-text responses using structured data.
  • Topic Modeling Agent: Extracts latent themes from enriched responses.
  • Deduplication Agent: Intelligently refines and consolidates the extracted topics using LLM-powered semantic analysis.

Changelog

For release notes and version history, see CHANGELOG.md.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Citation

If you use MALTopic in your research, please cite:

@software{Sharma2025maltopic,
  author = {Sharma, Yash},
  title = {MALTopic: A library for topic modeling},
  year = {2025},
  url = {https://github.com/yash91sharma/MALTopic-py}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maltopic-1.5.1.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maltopic-1.5.1-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file maltopic-1.5.1.tar.gz.

File metadata

  • Download URL: maltopic-1.5.1.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Linux/6.12.72-linuxkit

File hashes

Hashes for maltopic-1.5.1.tar.gz
Algorithm Hash digest
SHA256 41cb301de5b199e093e8cb2a04c7716a98a873ad8e060a100952adb25de26c3e
MD5 f831378eb6d56795076e148ed167f550
BLAKE2b-256 d8b9537e0aba1f3c2c1e3d134ac39482a4abfb772e3eaf11963f4d4736b52f45

See more details on using hashes here.

File details

Details for the file maltopic-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: maltopic-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 36.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Linux/6.12.72-linuxkit

File hashes

Hashes for maltopic-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7c2618cccc4f9e735c17baab0dddb29d92bda32a93f5dd1b61f534bfb14a3ec4
MD5 dcc1f8df2bbc0e7e05da9230e8511b6c
BLAKE2b-256 a3c9fc6f5955dce1ba6f1de9c49c692a2ad8aed972f8bccca31b4eb1a6fb4e45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page