A Python library for topic modeling, clustering, and NLP analysis.
Project description
TACTIK
Text Analysis, Clustering, Tuning, Information and Keyword Extraction
Tactik started as a side project to streamline clustering of aviation-related reports. The pipeline initially faced long processing times and became a bottleneck for analysis. These issues were addressed, and further functionality was added to enable intuitive topic extraction. The pipeline was adapted to work domain-agnostically while keeping the core use case in mind. With this functionality, we decided to release the package publicly so other researchers can contribute to it, build on it, or benefit from the included tools. Thank you for using TACTIK — we hope you find it as useful as we did in our research!
Features
- End-to-End Clustering Pipeline: Automated workflow from preprocessing to cluster analysis
- Modular Design: Use the different components as standalone modules or full pipelines
- Layered Effective Methods: UMAP dimensionality reduction + HDBSCAN clustering
- Hyperparameter Tuning: Automated parameter optimization using random search
- Keyword Extraction: Multiple methods including TF, TF-IDF, DF, and YAKE
- Topic Modeling: LDA-based topic discovery with BERT-powered semantic matching (still in development)
- Rich Visualizations: t-SNE plots with customizable styling and annotations
- Memory Efficient: Optimized for large datasets with lazy evaluation and caching
Installation
# Install tactik
pip install tactik
# Or install from source
git clone https://github.com/npsAub/tactik.git
cd tactik
pip install -e .
Dependencies
# Core dependencies
pip install pandas numpy matplotlib seaborn scikit-learn
pip install umap-learn hdbscan gensim nltk yake
pip install transformers torch
Core Components
1. ClusteringPipeline
Main orchestrator class that coordinates the entire analysis workflow.
Key Methods:
preprocess_data()- Text cleaning and stopword removalcluster_data()- Clustering with fixed parameterstune_and_cluster()- Clustering with hyperparameter tuningcluster_and_extract_keywords()- Integrated clustering + keyword extractioncluster_and_analyze_topics()- Integrated clustering + topic modelingvisualize_clusters()- Create cluster visualizationsget_cluster_summary()- Generate cluster statistics
2. Clustering & Tuning
Low-level clustering functions with hyperparameter optimization.
Key Functions:
tune_clustering_hyperparameters()- Random search optimizationapply_best_clustering()- Apply optimized parametersfull_clustering_pipeline()- Complete pipeline with tuningfull_clustering_pipeline_fixed_params()- Pipeline with fixed parameters
Supported Metrics:
davies_bouldin: Lower is better (measures cluster separation)calinski_harabasz: Higher is better (ratio of between/within cluster dispersion)
3. Keyword Extraction
Extract representative keywords from each cluster using multiple methods.
KeywordExtractor Class:
extract_keywords_per_cluster()- Extract keywords using multiple methodssave_keywords()- Save results to CSV
Extraction Methods:
- TF: Term Frequency
- TF-IDF: Term Frequency–Inverse Document Frequency
- TF-DF: Term Frequency–Document Frequency
- YAKE: Yet Another Keyword Extractor (long and short narratives)
4. Topic Modeling
Discover latent topics using LDA and match them to predefined designators.
TopicModeler Class:
train_lda()- Train Latent Dirichlet Allocation modelget_cluster_topics()- Get top topics per clustermatch_designators_to_topics()- Match topics to designators using BERT embeddingsget_bert_embedding()- Compute BERT embeddings for semantic matching
Default Aviation Safety Designators:
- Inadequate or inaccurate knowledge
- Poor judgment and decision-making
- Failure to follow procedures
- Poor communication
- Inadequate monitoring or vigilance
- Task management and prioritization
- Stress and psychological factors
- Physical or physiological factors
- Technical or system failures
- Environmental factors
5. Visualization
Create publication-ready visualizations of clustering results.
Visualization Functions:
plot_clusters()- Basic cluster scatter plotplot_clusters_with_annotations()- Plot with category annotationsplot_cluster_comparison()- Side-by-side comparison plotsset_visualization_style()- Configure plot stylingget_cluster_palette()- Generate color palettesget_cluster_markers()- Generate marker styles
Pipeline Architecture
Input Data (DataFrame)
↓
Preprocessing
├── Text cleaning
├── Stopword removal
└── Tokenization
↓
Vectorization (TF-IDF)
↓
Dimensionality Reduction (UMAP)
↓
Clustering (HDBSCAN)
↓
Visualization (t-SNE)
↓
Analysis
├── Keyword Extraction
└── Topic Modeling (LDA + BERT)
Performance Considerations
Memory Optimization
- DataFrame lazy copying
- Vectorization caching
- BERT embedding cache with
clear_cache()method - Incremental topic probability calculations
GPU Acceleration
GPU acceleration is available for BERT computations when initializing TopicModeler with use_gpu=True.
Large Datasets
For large datasets, consider:
- Disabling t-SNE computation with
compute_tsne=False - Using fixed parameters instead of hyperparameter tuning
- Clearing BERT embedding cache periodically
Evaluation Metrics
- Davies-Bouldin Score: Measures average similarity between clusters (lower is better)
- Calinski-Harabasz Score: Ratio of between-cluster to within-cluster variance (higher is better)
- Cluster Count: Number of discovered clusters
- Noise Ratio: Proportion of outlier points
Output Formats
Cluster Summary
DataFrame with columns: Cluster ID, Size, Percentage
Keywords DataFrame
DataFrame with columns: cluster, Yake Long, Yake Short, TF, TFIDF, TFDF
Topic Analysis
Dictionary containing:
cluster_topics: Mapping of clusters to top topicstopic_designators: Matching of topics to designatorsmodel: TopicModeler instance
Dependencies
- Core: pandas, numpy, scikit-learn
- Clustering: umap-learn, hdbscan
- Visualization: matplotlib, seaborn
- NLP: nltk, gensim, yake
- Deep Learning: transformers, torch
Contributing
Contributions are welcome! We encourage you to:
- Report Issues: Found a bug or have a feature request? Open an issue on GitHub
- Submit Pull Requests: Improvements to code, documentation, or tests are appreciated
- Share Use Cases: Let us know how you're using tactik
- Improve Documentation: Help us make TACTIK more accessible
Development Setup
# Clone the repository
git clone https://github.com/npsAub/tactik.git
cd tactik
# Install in development mode with dev dependencies
pip install -e ".[dev]"
# Run tests
unittest discover
# Run linting
flake8 tactik/
black tactik/
Contribution Guidelines
- Fork the repository and create a feature branch
- Write clear, documented code following the existing style
- Add tests for new functionality
- Update documentation as needed
- Submit a pull request with a clear description
For major changes, please open an issue first to discuss your proposal.
Citation
If you use TACTIK in your research, please cite:
@software{tactik,
title={tactik: Text Analysis, Clustering, Tuning, Information and Keyword Extraction},
author={Niklas P. Schulmeyer and Nicoletta Fala},
year={2025},
url={https://github.com/npsAub/tactik}
}
License
MIT License — See LICENSE for details
Contact
For questions or support, please open an issue on GitHub or contact [nps0027@auburn.edu].
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tactik-0.1.8.tar.gz.
File metadata
- Download URL: tactik-0.1.8.tar.gz
- Upload date:
- Size: 101.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92a7bd9caa5ba773e8796d5dd55598c25b2e40321d5e7f8d6a48811879da73f5
|
|
| MD5 |
52d59047713fb94b238b52dbbfa215b9
|
|
| BLAKE2b-256 |
043292510d943ba942b0775ac666cc798b631bd0923027f51e952be12c0f4632
|
File details
Details for the file tactik-0.1.8-py3-none-any.whl.
File metadata
- Download URL: tactik-0.1.8-py3-none-any.whl
- Upload date:
- Size: 67.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1843eed9dbe3eed246ed3309ea1927f681631d88e59ab89ecb80d610c957bf19
|
|
| MD5 |
7332a2b82439c282372f5fe61e9cb060
|
|
| BLAKE2b-256 |
f18b4fa9ec2a729d1beda97568e8eed58d23dfe1e0ea3546f0203cdab5b06f5a
|