Skip to main content

A standalone NMF topic modeling tool for Turkish and English texts

Project description

NMF Standalone

A comprehensive topic modeling library for Turkish and English texts using Non-negative Matrix Factorization (NMF).

Features

  • Multi-language Support: Native support for Turkish and English text processing
  • Advanced NMF Algorithms: Standard NMF and Orthogonal Projective NMF (OPNMF) variants
  • Modern Tokenization: BPE and WordPiece tokenizers for Turkish, traditional preprocessing for English
  • Comprehensive Preprocessing: Language-specific text cleaning, emoji processing, and normalization
  • Rich Visualizations: Word clouds, topic distribution plots, and co-occurrence heatmaps
  • Multiple Export Formats: Excel reports, JSON results, and database storage
  • Coherence Evaluation: Built-in topic coherence scoring for model evaluation
  • CLI and Python API: Both command-line interface and programmatic access

Installation

pip install nmf-standalone

For visualization features:

pip install nmf-standalone[visualization]

Quick Start

Command Line Interface

# Analyze Turkish app reviews with 5 topics
nmf-standalone analyze reviews.csv --column REVIEW --language TR --topics 5 --wordclouds

# Analyze English documents with lemmatization
nmf-standalone analyze docs.xlsx --column text --language EN --topics 10 --lemmatize --excel

# Use BPE tokenizer for Turkish text
nmf-standalone analyze data.csv --column content --language TR --tokenizer bpe --topics 7

Python API

from nmf_standalone import run_topic_analysis

# Simple analysis
result = run_topic_analysis(
    "data.csv",
    column="text_column", 
    language="TR",
    topics=5,
    generate_wordclouds=True
)

# Access results
topics = result['topic_word_scores']
for topic_id, words in topics.items():
    print(f"Topic {topic_id}: {', '.join([word for word, score in words[:5]])}")

# Advanced configuration
result = run_topic_analysis(
    "reviews.csv",
    column="review_text",
    language="TR", 
    topics=7,
    nmf_method="opnmf",  # Use projective NMF
    tokenizer_type="wordpiece",
    words_per_topic=20,
    export_excel=True,
    topic_distribution=True
)

Supported File Formats

  • CSV files: Automatic delimiter detection, UTF-8 encoding
  • Excel files: .xlsx and .xls formats
  • Text columns: Any column containing text data for analysis

Language Support

Turkish

  • Preprocessing: Advanced text cleaning, Turkish-specific normalization
  • Tokenization: BPE (Byte-Pair Encoding) and WordPiece tokenizers
  • Emoji Processing: Intelligent emoji-to-text conversion
  • TF-IDF: BM25 and traditional TF-IDF with Turkish language adaptations

English

  • Preprocessing: Standard NLP preprocessing with optional lemmatization
  • Tokenization: Traditional word-based tokenization
  • Lemmatization: NLTK-based lemmatization support
  • TF-IDF: Classical TF-IDF with multiple weighting schemes

Algorithm Options

NMF Methods

  • Standard NMF (nmf): Classical non-negative matrix factorization
  • Orthogonal Projective NMF (opnmf): Enhanced variant for better topic separation

Tokenization (Turkish)

  • BPE (bpe): Byte-Pair Encoding for subword tokenization
  • WordPiece (wordpiece): Google's WordPiece algorithm

Output Formats

Generated Files

  • Word Clouds: PNG images for each topic showing prominent words
  • Excel Reports: Detailed topic-word matrices with scores
  • Topic Distribution: Plots showing document-topic relationships
  • JSON Results: Machine-readable topic and document data
  • Database Storage: SQLite databases for persistent storage

Directory Structure

Output/
└── {dataset_name}/
    ├── {dataset}_topics.xlsx           # Excel report
    ├── {dataset}_coherence_scores.json # Model evaluation
    ├── {dataset}_document_dist.png     # Topic distribution
    ├── top_docs_{dataset}.json         # Top documents per topic
    └── wordclouds/                     # Word cloud images
        ├── Topic_00.png
        ├── Topic_01.png
        └── ...

Configuration Options

Parameter Type Default Description
topics int 5 Number of topics to extract
words_per_topic int 15 Top words to display per topic
language str "TR" Language code (TR/EN)
nmf_method str "nmf" Algorithm variant (nmf/opnmf)
tokenizer_type str "bpe" Tokenizer for Turkish (bpe/wordpiece)
lemmatize bool False Apply lemmatization (English only)
generate_wordclouds bool True Create word cloud visualizations
export_excel bool True Export Excel reports
topic_distribution bool True Generate distribution plots

Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.24.0
  • Pandas ≥ 2.0.0
  • scikit-learn ≥ 1.3.0
  • NLTK ≥ 3.8.0
  • gensim ≥ 4.3.0
  • tokenizers ≥ 0.19.0

Examples

Turkish App Review Analysis

result = run_topic_analysis(
    "app_reviews.csv",
    column="review_text",
    language="TR",
    topics=8,
    tokenizer_type="bpe",
    generate_wordclouds=True,
    export_excel=True
)

English Document Classification

result = run_topic_analysis(
    "documents.xlsx", 
    column="content",
    language="EN",
    topics=12,
    lemmatize=True,
    nmf_method="opnmf",
    words_per_topic=25
)

Medical Text Analysis

result = run_topic_analysis(
    "medical_notes.csv",
    column="impression", 
    language="EN",
    topics=15,
    lemmatize=True,
    generate_wordclouds=True,
    topic_distribution=True
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this library in your research, please cite:

@software{nmf_standalone,
  author = {Emir Karayagiz},
  title = {NMF Standalone: Topic Modeling for Turkish and English Texts},
  url = {https://github.com/emirkarayagiz/nmf-standalone},
  version = {0.1.0},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmf_standalone-0.1.9.tar.gz (58.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nmf_standalone-0.1.9-py3-none-any.whl (76.7 kB view details)

Uploaded Python 3

File details

Details for the file nmf_standalone-0.1.9.tar.gz.

File metadata

  • Download URL: nmf_standalone-0.1.9.tar.gz
  • Upload date:
  • Size: 58.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.18

File hashes

Hashes for nmf_standalone-0.1.9.tar.gz
Algorithm Hash digest
SHA256 26a644a70b6b6d70af9efbcbfcfb7cb7b548b4c820ca8330fe77df50d08947e5
MD5 8b95fd40ba669e22a0018e149484b385
BLAKE2b-256 f0f1ecefeecebdbfbb9e234e040995042c7f472e415d6518c3d56737f6333c78

See more details on using hashes here.

File details

Details for the file nmf_standalone-0.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for nmf_standalone-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 7498104ad3e2b7676b111798cb5189053da7493783a89a4773ce04eb821d4a85
MD5 f7448108540147d5bc1375e754a131d5
BLAKE2b-256 68f1921f4af9495004df52af19ee4b2ad57b6727964686adbc27e77c16dd5ba6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page