Skip to main content

A Python library for topic modeling, clustering, and NLP analysis.

Project description

TACTIK Header

TACTIK

Text Analysis, Clustering, Tuning, Information and Keyword Extraction

Tactik started as a side project to streamline clustering of aviation-related reports. The pipeline initially faced long processing times and became a bottleneck for analysis. These issues were addressed, and further functionality was added to enable intuitive topic extraction. The pipeline was adapted to work domain-agnostically while keeping the core use case in mind. With this functionality, we decided to release the package publicly so other researchers can contribute to it, build on it, or benefit from the included tools. Thank you for using TACTIK — we hope you find it as useful as we did in our research!

Features

  • End-to-End Clustering Pipeline: Automated workflow from preprocessing to cluster analysis
  • Modular Design: Use the different components as standalone modules or full pipelines
  • Layered Effective Methods: UMAP dimensionality reduction + HDBSCAN clustering
  • Hyperparameter Tuning: Automated parameter optimization using random search
  • Keyword Extraction: Multiple methods including TF, TF-IDF, DF, and YAKE
  • Topic Modeling: LDA-based topic discovery with BERT-powered semantic matching (still in development)
  • Rich Visualizations: t-SNE plots with customizable styling and annotations
  • Memory Efficient: Optimized for large datasets with lazy evaluation and caching

Installation

# Install tactik
pip install tactik

# Or install from source
git clone https://github.com/npsAub/tactik.git
cd tactik
pip install -e .

Dependencies

# Core dependencies
pip install pandas numpy matplotlib seaborn scikit-learn
pip install umap-learn hdbscan gensim nltk yake
pip install transformers torch

Core Components

1. ClusteringPipeline

Main orchestrator class that coordinates the entire analysis workflow.

Key Methods:

  • preprocess_data() - Text cleaning and stopword removal
  • cluster_data() - Clustering with fixed parameters
  • tune_and_cluster() - Clustering with hyperparameter tuning
  • cluster_and_extract_keywords() - Integrated clustering + keyword extraction
  • cluster_and_analyze_topics() - Integrated clustering + topic modeling
  • visualize_clusters() - Create cluster visualizations
  • get_cluster_summary() - Generate cluster statistics

2. Clustering & Tuning

Low-level clustering functions with hyperparameter optimization.

Key Functions:

  • tune_clustering_hyperparameters() - Random search optimization
  • apply_best_clustering() - Apply optimized parameters
  • full_clustering_pipeline() - Complete pipeline with tuning
  • full_clustering_pipeline_fixed_params() - Pipeline with fixed parameters

Supported Metrics:

  • davies_bouldin: Lower is better (measures cluster separation)
  • calinski_harabasz: Higher is better (ratio of between/within cluster dispersion)

3. Keyword Extraction

Extract representative keywords from each cluster using multiple methods.

KeywordExtractor Class:

  • extract_keywords_per_cluster() - Extract keywords using multiple methods
  • save_keywords() - Save results to CSV

Extraction Methods:

  • TF: Term Frequency
  • TF-IDF: Term Frequency–Inverse Document Frequency
  • TF-DF: Term Frequency–Document Frequency
  • YAKE: Yet Another Keyword Extractor (long and short narratives)

4. Topic Modeling

Discover latent topics using LDA and match them to predefined designators.

TopicModeler Class:

  • train_lda() - Train Latent Dirichlet Allocation model
  • get_cluster_topics() - Get top topics per cluster
  • match_designators_to_topics() - Match topics to designators using BERT embeddings
  • get_bert_embedding() - Compute BERT embeddings for semantic matching

Default Aviation Safety Designators:

  • Inadequate or inaccurate knowledge
  • Poor judgment and decision-making
  • Failure to follow procedures
  • Poor communication
  • Inadequate monitoring or vigilance
  • Task management and prioritization
  • Stress and psychological factors
  • Physical or physiological factors
  • Technical or system failures
  • Environmental factors

5. Visualization

Create publication-ready visualizations of clustering results.

Visualization Functions:

  • plot_clusters() - Basic cluster scatter plot
  • plot_clusters_with_annotations() - Plot with category annotations
  • plot_cluster_comparison() - Side-by-side comparison plots
  • set_visualization_style() - Configure plot styling
  • get_cluster_palette() - Generate color palettes
  • get_cluster_markers() - Generate marker styles

Pipeline Architecture

Input Data (DataFrame)
    ↓
Preprocessing
    ├── Text cleaning
    ├── Stopword removal
    └── Tokenization
    ↓
Vectorization (TF-IDF)
    ↓
Dimensionality Reduction (UMAP)
    ↓
Clustering (HDBSCAN)
    ↓
Visualization (t-SNE)
    ↓
Analysis
    ├── Keyword Extraction
    └── Topic Modeling (LDA + BERT)

Performance Considerations

Memory Optimization

  • DataFrame lazy copying
  • Vectorization caching
  • BERT embedding cache with clear_cache() method
  • Incremental topic probability calculations

GPU Acceleration

GPU acceleration is available for BERT computations when initializing TopicModeler with use_gpu=True.

Large Datasets

For large datasets, consider:

  • Disabling t-SNE computation with compute_tsne=False
  • Using fixed parameters instead of hyperparameter tuning
  • Clearing BERT embedding cache periodically

Evaluation Metrics

  • Davies-Bouldin Score: Measures average similarity between clusters (lower is better)
  • Calinski-Harabasz Score: Ratio of between-cluster to within-cluster variance (higher is better)
  • Cluster Count: Number of discovered clusters
  • Noise Ratio: Proportion of outlier points

Output Formats

Cluster Summary

DataFrame with columns: Cluster ID, Size, Percentage

Keywords DataFrame

DataFrame with columns: cluster, Yake Long, Yake Short, TF, TFIDF, TFDF

Topic Analysis

Dictionary containing:

  • cluster_topics: Mapping of clusters to top topics
  • topic_designators: Matching of topics to designators
  • model: TopicModeler instance

Dependencies

  • Core: pandas, numpy, scikit-learn
  • Clustering: umap-learn, hdbscan
  • Visualization: matplotlib, seaborn
  • NLP: nltk, gensim, yake
  • Deep Learning: transformers, torch

Contributing

Contributions are welcome! We encourage you to:

  • Report Issues: Found a bug or have a feature request? Open an issue on GitHub
  • Submit Pull Requests: Improvements to code, documentation, or tests are appreciated
  • Share Use Cases: Let us know how you're using tactik
  • Improve Documentation: Help us make TACTIK more accessible

Development Setup

# Clone the repository
git clone https://github.com/npsAub/tactik.git
cd tactik

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
unittest discover

# Run linting
flake8 tactik/
black tactik/

Contribution Guidelines

  1. Fork the repository and create a feature branch
  2. Write clear, documented code following the existing style
  3. Add tests for new functionality
  4. Update documentation as needed
  5. Submit a pull request with a clear description

For major changes, please open an issue first to discuss your proposal.

Citation

If you use TACTIK in your research, please cite:

@software{tactik,
  title={tactik: Text Analysis, Clustering, Tuning, Information and Keyword Extraction},
  author={Niklas P. Schulmeyer and Nicoletta Fala},
  year={2025},
  url={https://github.com/npsAub/tactik}
}

License

MIT License — See LICENSE for details

Contact

For questions or support, please open an issue on GitHub or contact [nps0027@auburn.edu].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tactik-0.1.8.tar.gz (101.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tactik-0.1.8-py3-none-any.whl (67.0 kB view details)

Uploaded Python 3

File details

Details for the file tactik-0.1.8.tar.gz.

File metadata

  • Download URL: tactik-0.1.8.tar.gz
  • Upload date:
  • Size: 101.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for tactik-0.1.8.tar.gz
Algorithm Hash digest
SHA256 92a7bd9caa5ba773e8796d5dd55598c25b2e40321d5e7f8d6a48811879da73f5
MD5 52d59047713fb94b238b52dbbfa215b9
BLAKE2b-256 043292510d943ba942b0775ac666cc798b631bd0923027f51e952be12c0f4632

See more details on using hashes here.

File details

Details for the file tactik-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: tactik-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 67.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for tactik-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 1843eed9dbe3eed246ed3309ea1927f681631d88e59ab89ecb80d610c957bf19
MD5 7332a2b82439c282372f5fe61e9cb060
BLAKE2b-256 f18b4fa9ec2a729d1beda97568e8eed58d23dfe1e0ea3546f0203cdab5b06f5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page