FastText-based response similarity analyzer for human raters
Project description
Simars - FastText-based similarity analysis of answers & responses for human raters
Simars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.
🚀 Features
- Text Preprocessing: Comprehensive Korean and English text cleaning and tokenization
- FastText Integration: Training and fine-tuning of FastText models with Korean support
- Dimensionality Reduction: Support for UMAP, PCA, and t-SNE algorithms
- Clustering: HDBSCAN clustering for response grouping
- Interactive Visualization: Plotly-based interactive scatter plots with multiple visualization modes
- Jamo Processing: Advanced Korean text processing with Jamo decomposition
📦 Installation
pip install simars
# Install spaCy English model (required for text processing)
python -m spacy download en_core_web_sm
Development Installation
git clone https://github.com/h000000nkim/simars.git
cd simars
pip install -e ".[dev]"
🔧 Dependencies
- Core:
gensim,numpy,pandas,scikit-learn,umap-learn - NLP:
jamo,pecab,spacy - Visualization:
plotly - Clustering:
hdbscan - Development:
pytest,ruff,mkdocs-material
📖 Quick Start
Basic Usage
import simars
import numpy as np
# Sample data
answers = np.array([["허무"], ["흡수율"], ["부사어"]])
responses = np.array([
["허무", "공허", "무상", "허무감", "초월"],
["흡수율", "흡수", "반사율", "알베도"],
["부사어", "부사", "수식어", "부가어"]
])
informations = np.array([
"문학 문제에 대한 정서적 태도",
"과학 제재의 핵심 개념",
"문법 성분 분석"
])
# Initialize Simars
analyzer = simars.Fastrs(
answers=answers,
responses=responses,
informations=informations
)
# Preprocess text data
analyzer.preprocess()
# Train FastText model
model = analyzer.train(
vector_size=100,
window=5,
min_count=1,
epochs=10
)
# Reduce dimensionality
coordinates = analyzer.reduce(method="umap", n_neighbors=5)
# Perform clustering
analyzer.hdbscanize()
# Visualize results
figures = analyzer.visualize()
for fig in figures:
fig.show()
Advanced Usage with Custom Data Structure
# Using dictionary format
data = {
"item1": {
"answer": ["정답1"],
"response": ["정답1", "오답1", "오답2"],
"information": "문항 설명"
},
"item2": {
"answer": ["정답2"],
"response": ["정답2", "유사답", "오답"],
"information": "다른 문항 설명"
}
}
analyzer = simars.Fastrs(data=data)
analyzer.preprocess()
# Fine-tune existing model
pretrained_model = simars.util.get_pretrained_model()
analyzer.finetune(model=pretrained_model, epochs=5)
# Advanced reduction with custom parameters
coordinates = analyzer.reduce(
method="umap",
n_neighbors=15,
min_dist=0.1,
metric="cosine"
)
🛠️ Core Components
Fastrs Class
Main class for response similarity analysis:
preprocess(): Clean, tokenize, and prepare text datatrain(): Train new FastText model from scratchfinetune(): Fine-tune existing FastText modelreduce(): Reduce embeddings dimensionalityhdbscanize(): Perform clustering analysisvisualize(): Create interactive visualizations
Item Class
Individual item processor for detailed analysis:
clean(): Text cleaning with customizable optionstokenize(): Korean/English tokenizationjamoize(): Korean Jamo decompositionformatize(): Prepare data for FastText training
Preprocessing Module
Advanced text preprocessing functions:
clean(): Multi-option text cleaningtokenize(): Morphological analysis with PeCabjamoize(): Korean character decompositionformatize(): Data formatting for training
Visualization Module
Interactive plotting with Plotly:
scatter(): Unified scatter plot function- Multiple plot types: simple, value count, labeled, combined
- Customizable themes and color schemes
📊 Visualization Types
Simple Scatter Plot
Basic 2D visualization highlighting answers vs responses.
Value Count Scatter Plot
3D visualization showing response frequency in the z-axis.
Labeled Scatter Plot
Color-coded visualization based on clustering results.
Combined Scatter Plot
3D visualization combining clustering and frequency information.
⚙️ Configuration
simars uses JSON configuration files for customization:
color_schemes.json: Color themes for visualizationsplot_config.json: Plot layout and styling optionsreduction_defaults.json: Default parameters for dimensionality reductionfasttext_defaults.json: Default FastText training parameters
🧪 Testing
Run the test suite:
# All tests
pytest
# Unit tests only
pytest tests/unit/
# Integration tests only
pytest tests/integration/
# With coverage
pytest --cov=simars tests/
📚 Use Cases
- Educational Assessment: Analyze student response patterns
- Content Analysis: Group similar text responses
- Quality Assurance: Identify outlier responses for review
- Research: Study response similarity patterns in surveys
🤝 Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
👤 Author
Hoon Kim - h000000nkim@gmail.com
🔗 Links
📈 Roadmap
- Support for additional languages
- Web interface for easy usage
- Additional clustering algorithms
- Export functionality for results
- Integration with popular ML frameworks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simars-0.1.0.tar.gz.
File metadata
- Download URL: simars-0.1.0.tar.gz
- Upload date:
- Size: 12.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f3b0055cca7e0ff0c53f49bf7a0c58854dc4b3e019fce40f942b22b48e483ef
|
|
| MD5 |
08b560e822ab9c282ce9e0a317a4621b
|
|
| BLAKE2b-256 |
f55b143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e
|
File details
Details for the file simars-0.1.0-py3-none-any.whl.
File metadata
- Download URL: simars-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e37ceabb49cdea2c3c8887b50682ff1293e44ff8e3834b9e09632d52d764ccaf
|
|
| MD5 |
4e605ed914ed767f01a5cb0babd9732c
|
|
| BLAKE2b-256 |
7d0a07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7
|