Multi-lingual Advanced NMF-based Topic Analysis - A comprehensive NMF topic modeling tool for Turkish and English texts

These details have not been verified by PyPI

Project links

Project description

MANTA (Multi-lingual Advanced NMF-based Topic Analysis)

A comprehensive topic modeling system using Non-negative Matrix Factorization (NMF) and Non-negative Matrix Tri-Factorization (NMTF) that supports both English and Turkish text processing. Features advanced tokenization techniques, multiple factorization algorithms including NMTF for topic relationship analysis, and rich visualization capabilities.

To cite this work;

@article{KARAYAGIZ2025102386,
title = {Manta: Multi-lingual advanced NMF-based topic analysis},
journal = {SoftwareX},
volume = {32},
pages = {102386},
year = {2025},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2025.102386},
url = {https://www.sciencedirect.com/science/article/pii/S2352711025003528},
author = {Emir Karayağız and Tolga Berber},
keywords = {Topic modeling, Non-negative matrix factorization, Python, Natural language processing, Information retrieval},
abstract = {This paper presents MANTA (Multi-lingual Advanced NMF-based Topic Analysis), a novel open-source Python library that provides an integrated pipeline to address key limitations in existing topic modeling workflows. MANTA provides an integrated, easy-to-use pipeline for Non-negative Matrix Factorization (NMF) based topic analysis, uniquely combining corpus-specific subword tokenization (BPE/WordPiece) with advanced term weighting schemes (SMART, BM25) and flexible NMF solver options, including a high-performance Projective NMF method. It offers native support for both English and morphologically complex languages like Turkish. With a simple one-function interface and a command-line utility, MANTA lowers the technical barrier for sophisticated topic analysis, making it a powerful tool for researchers in computational social science and digital humanities.}
}

Quick Start

Installing locally for Development

To build and run the app locally for development: First clone the repository:

git clone https://github.com/emirkyz/manta.git

After cloning, navigate to the project directory and create a virtual environment:

cd manta
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Next, install the required dependencies. If you have pip installed, you can run:

pip install -e .

or if you have uv installed, you can use:

uv pip install -e .

Installation from PyPI

pip install manta-topic-modelling

After that you can import and use the app.

Python API Usage

from manta import run_topic_analysis

# Simple topic modeling
results = run_topic_analysis(
    filepath="data.csv",
    column="review_text",
    language="EN",
    topic_count=5,
    lemmatize=True
)

# Turkish text analysis
results = run_topic_analysis(
    filepath="turkish_reviews.csv", 
    column="yorum_metni",
    language="TR",
    topic_count=8,
    tokenizer_type="bpe",
    generate_wordclouds=True
)

# NMTF analysis for topic relationship discovery
results = run_topic_analysis(
    filepath="data.csv",
    column="text_content",
    language="TR", 
    topic_count=6,
    nmf_method="nmtf",
    generate_wordclouds=True
)

Result Structure

{
"state": State of the analysis, either "success" or "error",
"message": Message about the result of the analysis,
"data_name": Name of the input data file,
"topic_word_scores": JSON object containing topics and their top words with scores,
"topic_doc_scores": JSON object containing topics and their top documents with scores,
"coherence_scores": JSON object containing coherence scores for each topic,
"topic_dist_img": Matplotlib plt object of topic distribution plot if `gen_topic_distribution` is True,
"topic_document_counts": Count of documents per topic,
"topic_relationships": Topic-to-topic relationship matrix (only for NMTF method),
}

For example:
{
  "state": "success",
  "message": "Analysis completed successfully",
  "data_name": "reviews.csv",
  "topic_word_scores": {
    "topic_0": {
        "word1": 0.15,
        "word2": 0.12,
        "word3": 0.10
        }
    },
  "topic_doc_scores":{
          "topic_0": [
                {
                    "document": "Sample document text...",
                    "score": 0.78
                }
            ],
    }
  "coherence_scores": {
        "gensim": {
           "umass_average": -1.4328882390292266,
            "umass_per_topic": {
                "topic_0": -1.4328882390292266,
                "topic_1": -1.1234567890123456,
                "topic_2": -0.9876543210987654
                }
        }
    },
  "topic_dist_img": "<matplotlib plot object>",
  "topic_document_counts": [____]
}

Command Line Usage

# Turkish text analysis
manta-topic-modelling analyze data.csv --column text --language TR --topics 5

# English text analysis with lemmatization and visualizations
manta-topic-modelling analyze data.csv --column content --language EN --topics 10 --lemmatize --wordclouds --excel

# Custom tokenizer for Turkish text
manta-topic-modelling analyze reviews.csv --column review_text --language TR --topics 8 --tokenizer bpe --wordclouds

# NMTF analysis for topic relationship discovery
manta-topic-modelling analyze data.csv --column text --language TR --topics 5 --nmf-method nmtf

# Filter by app name and country
manta-topic-modelling analyze reviews.csv --column REVIEW --language TR --topics 5 --filter-app MyApp --filter-country TR

# Custom filtering columns
manta-topic-modelling analyze data.csv --column text --language TR --topics 5 --filter-app-column APP_ID --filter-country-column REGION

# Disable emoji processing for faster processing
manta-topic-modelling analyze data.csv --column text --language EN --topics 5 --emoji-map False

Package Structure

manta/
├── _functions/
│   ├── common_language/          # Shared functionality across languages
│   │   ├── emoji_processor.py    # Emoji handling utilities
│   │   └── topic_extractor.py    # Cross-language topic analysis and extraction
│   ├── english/                  # English text processing modules
│   │   ├── english_entry.py             # English text processing entry point
│   │   ├── english_preprocessor.py      # Text cleaning and preprocessing
│   │   ├── english_vocabulary.py        # Vocabulary creation
│   │   ├── english_text_encoder.py      # Text-to-numerical conversion
│   │   ├── english_topic_analyzer.py    # Topic extraction utilities
│   │   ├── english_topic_output.py      # Topic visualization and output
│   │   └── english_nmf_core.py          # NMF implementation for English
│   ├── nmf/                      # NMF algorithm implementations
│   │   ├── nmf_orchestrator.py          # Main NMF interface
│   │   ├── nmf_initialization.py        # Matrix initialization strategies
│   │   ├── nmf_basic.py                 # Standard NMF algorithm
│   │   ├── nmf_projective_basic.py      # Basic projective NMF
│   │   ├── nmf_projective_enhanced.py   # Enhanced projective NMF
│   │   └── nmtf/                        # Non-negative Matrix Tri-Factorization
│   │       ├── nmtf.py                  # NMTF implementation with topic relationships
│   │       ├── nmtf_init.py             # NMTF initialization utilities
│   │       ├── nmtf_util.py             # NMTF helper functions
│   │       ├── extract_nmtf_topics.py   # Topic extraction for NMTF results
│   │       └── example_usage.py         # NMTF usage examples
│   ├── tfidf/                    # TF-IDF calculation modules
│   │   ├── tfidf_english_calculator.py  # English TF-IDF implementation
│   │   ├── tfidf_turkish_calculator.py  # Turkish TF-IDF implementation
│   │   ├── tfidf_tf_functions.py        # Term frequency functions
│   │   ├── tfidf_idf_functions.py       # Inverse document frequency functions
│   │   └── tfidf_bm25_turkish.py        # BM25 implementation for Turkish
│   └── turkish/                  # Turkish text processing modules
│       ├── turkish_entry.py             # Turkish text processing entry point
│       ├── turkish_preprocessor.py      # Turkish text cleaning
│       ├── turkish_tokenizer_factory.py # Tokenizer creation and training
│       ├── turkish_text_encoder.py      # Text-to-numerical conversion
│       └── turkish_tfidf_generator.py   # TF-IDF matrix generation
├── utils/                        # Helper utilities (organized into sub-modules)
│   ├── analysis/                       # Analysis utilities
│   │   ├── coherence_score.py              # Topic coherence evaluation
│   │   ├── distance_two_words.py           # Word distance calculation
│   │   ├── umass_test.py                   # UMass coherence testing
│   │   ├── word_cooccurrence.py            # Word co-occurrence analysis
│   │   └── word_cooccurrence_analyzer.py   # Advanced word co-occurrence analysis
│   ├── console/                        # Console management
│   │   └── console_manager.py              # Console and logging management utilities
│   ├── database/                       # Database utilities
│   │   ├── database_manager.py             # Database connection and management utilities
│   │   └── save_topics_db.py               # Topic database saving utilities
│   ├── export/                         # Export functionality
│   │   ├── export_excel.py                 # Excel export functionality
│   │   ├── json_to_excel.py                # JSON to Excel conversion utilities
│   │   ├── save_doc_score_pair.py          # Document-score pair saving utilities
│   │   └── save_word_score_pair.py         # Word-score pair saving utilities
│   ├── preprocess/                     # Preprocessing utilities
│   │   └── combine_number_suffix.py         # Number and suffix combination utilities
│   ├── visualization/                  # Visualization utilities
│   │   ├── gen_cloud.py                    # Word cloud generation
│   │   ├── image_to_base.py                # Image to base64 conversion
│   │   ├── topic_dist.py                   # Topic distribution plotting
│   │   └── visualizer.py                   # General visualization utilities
│   └── agent/                          # AI assistant utilities
│       ├── claude_prompt_generator.py       # Claude AI prompt generation utilities
│       └── claude_prompt_generator.html     # HTML interface for prompt generation
├── cli.py                        # Command-line interface
├── standalone_nmf.py             # Core NMF implementation
└── __init__.py                   # Package initialization and public API

Installation

From PyPI (Recommended)

pip install manta-topic-modelling

From Source (Development)

Clone the repository:

git clone https://github.com/emirkyz/manta.git
cd manta

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Command Line Interface

The package provides the manta-topic-modelling command with an analyze subcommand:

# Basic usage
manta-topic-modelling analyze data.csv --column text --language TR --topics 5

# Advanced usage with all options
manta-topic-modelling analyze reviews.csv \
  --column review_text \
  --language EN \
  --topics 10 \
  --words-per-topic 20 \
  --nmf-method pnmf \
  --lemmatize \
  --wordclouds \
  --excel \
  --topic-distribution \
  --output-name my_analysis

Command Line Options

Required Arguments:

filepath: Path to input CSV or Excel file
--column, -c: Name of column containing text data
--language, -l: Language ("TR" for Turkish, "EN" for English)

Optional Arguments:

--topics, -t: Number of topics to extract (default: 5)
--output-name, -o: Custom name for output files (default: auto-generated)
--tokenizer: Tokenizer type for Turkish ("bpe" or "wordpiece", default: "bpe")
--nmf-method: Factorization algorithm ("nmf", "pnmf", or "nmtf", default: "nmf")
--words-per-topic: Number of top words per topic (default: 15)
--lemmatize: Apply lemmatization for English text
--emoji-map: Enable emoji processing and mapping (default: True). Use --emoji-map False to disable
--wordclouds: Generate word cloud visualizations
--excel: Export results to Excel format
--topic-distribution: Generate topic distribution plots
--separator: CSV separator character (default: "|")
--filter-app: Filter data by specific app name
--filter-app-column: Column name for app filtering (default: "PACKAGE_NAME")
--filter-country: Filter data by country code (e.g., TR, US, GB)
--filter-country-column: Column name for country filtering (default: "COUNTRY")

Python API

from manta import run_topic_analysis

# Basic English text analysis
results = run_topic_analysis(
    filepath="data.csv",
    column="review_text",
    language="EN",
    topic_count=5,
    lemmatize=True,
    generate_wordclouds=True,
    export_excel=True
)

# Advanced Turkish text analysis with filtering
results = run_topic_analysis(
    filepath="turkish_reviews.csv",
    column="yorum_metni",
    language="TR",
    topic_count=10,
    words_per_topic=15,
    tokenizer_type="bpe",
    nmf_method="nmf",
    generate_wordclouds=True,
    export_excel=True,
    topic_distribution=True,
    filter_app=True,
    data_filter_options={
        "filter_app_name": "MyApp",
        "filter_app_column": "APP_NAME", 
        "filter_app_country": "TR",
        "filter_app_country_column": "COUNTRY_CODE"
    }
)

API Parameters

Required:

filepath (str): Path to input CSV or Excel file
column (str): Name of column containing text data

Optional:

separator (str): CSV separator character (default: ",")
language (str): "TR" for Turkish, "EN" for English (default: "EN")
topic_count (int): Number of topics to extract (default: 5)
nmf_method (str): "nmf", "pnmf", or "nmtf" algorithm variant (default: "nmf")
lemmatize (bool): Apply lemmatization for English (default: False)
tokenizer_type (str): "bpe" or "wordpiece" for Turkish (default: "bpe")
words_per_topic (int): Top words to show per topic (default: 15)
word_pairs_out (bool): Create word pairs output (default: True)
generate_wordclouds (bool): Create word cloud visualizations (default: True)
export_excel (bool): Export results to Excel (default: True)
topic_distribution (bool): Generate distribution plots (default: True)
filter_app (bool): Enable app filtering (default: False)
data_filter_options (dict): Advanced filtering options with keys (all default to empty string):
- filter_app_name (str): App name for filtering
- filter_app_column (str): Column name for app filtering (default: "PACKAGE_NAME")
- filter_app_country (str): Country code for filtering (case-insensitive)
- filter_app_country_column (str): Column name for country filtering (default: "COUNTRY")
emoji_map (bool): Enable emoji processing and mapping (default: False)
output_name (str): Custom output directory name (default: auto-generated)
save_to_db (bool): Whether to persist data to database (default: False)
output_dir (str): Base directory for outputs (default: current working directory)
n_grams_to_discover (int): Number of n-grams to discover via BPE for English text (default: None, disabled)

Outputs

The analysis generates several outputs in an Output/ directory (created at runtime), organized in a subdirectory named after your analysis:

Topic-Word Excel File: .xlsx file containing top words for each topic and their scores
Word Clouds: PNG images of word clouds for each topic (if generate_wordclouds=True)
Topic Distribution Plot: Plot showing distribution of documents across topics (if topic_distribution=True)
Coherence Scores: JSON file with coherence scores for the topics
Top Documents: JSON file listing most representative documents for each topic

Features

Multi-language Support: Optimized processing for both Turkish and English texts
Advanced Tokenization: BPE and WordPiece tokenizers for Turkish, traditional tokenization for English
Multiple Factorization Algorithms: Standard NMF, Orthogonal Projective NMF (PNMF), and Non-negative Matrix Tri-Factorization (NMTF)
Advanced NMF Variants: Hierarchical NMF, Online NMF, and Symmetric NMF implementations
N-gram Discovery: Automatic discovery of meaningful word combinations using BPE for English text
Rich Visualizations: Word clouds and topic distribution plots
Flexible Export: Excel and JSON export formats with organized export utilities
Coherence Evaluation: Built-in topic coherence scoring and advanced analysis tools
Database Management: Comprehensive SQLite database integration with dedicated management utilities
Modular Architecture: Organized utility modules for analysis, visualization, export, and preprocessing
Text Preprocessing: Language-specific text cleaning and preprocessing

N-gram Discovery

MANTA supports automatic n-gram discovery using BPE (Byte Pair Encoding) for English text. This feature identifies frequently occurring word combinations and adds them to the vocabulary as new tokens.

To enable n-gram discovery, use the n_grams_to_discover parameter:

results = run_topic_analysis(
    filepath="data.csv",
    column="text",
    language="EN",
    n_grams_to_discover=200  # Discover 200 word combinations
)

This can improve topic quality by capturing meaningful phrases like "machine_learning" or "climate_change" as single tokens.

Requirements

Python 3.9+
Dependencies are automatically installed with the package

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and questions, please open an issue on the GitHub repository

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.1

Jan 20, 2026

This version

0.9

Jan 20, 2026

0.8

Nov 27, 2025

0.7.1

Sep 14, 2025

0.7.0

Sep 14, 2025

0.5.5

Jul 22, 2025

0.5.4

Jul 22, 2025

0.5.3

Jul 18, 2025

0.5.2

Jul 18, 2025

0.5.1

Jul 18, 2025

0.5.0

Jul 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manta_topic_modelling-0.9.tar.gz (310.6 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

manta_topic_modelling-0.9-py3-none-any.whl (378.1 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file manta_topic_modelling-0.9.tar.gz.

File metadata

Download URL: manta_topic_modelling-0.9.tar.gz
Upload date: Jan 20, 2026
Size: 310.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for manta_topic_modelling-0.9.tar.gz
Algorithm	Hash digest
SHA256	`7da96646cc0fa1c745bbee028d9a03dfb23a87e4eac57749041f68d4ab408c40`
MD5	`37d6c70fda18e317f654700fe6976eaa`
BLAKE2b-256	`8fb8f9a4f78a5adc2b5bf8d6f6ba6f6b2017fdada512a354bfff0429f7ef4987`

See more details on using hashes here.

File details

Details for the file manta_topic_modelling-0.9-py3-none-any.whl.

File metadata

Download URL: manta_topic_modelling-0.9-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 378.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for manta_topic_modelling-0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a34a747d1704ba4406a3e8684567308190f12a0e17111a983925e11361d563a6`
MD5	`b7d9a9dcc43a96721967a51d9ed19a0f`
BLAKE2b-256	`1aa6e848da590f71735204588a21dc4708eb4cf6e92e5604b0e3035b74eb0592`

See more details on using hashes here.

manta-topic-modelling 0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MANTA (Multi-lingual Advanced NMF-based Topic Analysis)

To cite this work;

Quick Start

Installing locally for Development

Installation from PyPI

Python API Usage

Result Structure

Command Line Usage

Package Structure

Installation

From PyPI (Recommended)

From Source (Development)

Usage

Command Line Interface

Command Line Options

Python API

API Parameters

Outputs

Features

N-gram Discovery

Requirements

License

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes