Python package used to apply NLP interactive clustering methods.
Project description
Interactive Clustering
Python package used to apply NLP interactive clustering methods.
Quick description
Interactive clustering is a method intended to assist in the design of a training data set.
This iterative process begins with an unlabeled dataset, and it uses a sequence of two substeps :
-
the user defines constraints on the data ;
-
the machine performs data partitioning using a constrained clustering algorithm.
Thus, at each step of the process :
-
the user corrects the clustering of the previous steps using constraints, and
-
the machine offers a corrected and more relevant data partitioning for the next step.
The process use severals objects :
-
a constraints manager : its role is to manage the constraints annotated by the user and to feed back the information deduced (such as the transitivity between constraints or the situation of inconsistency) ;
-
a constraints sampler : its role is to select the most relevant data during the annotation of constraints by the user ;
-
a constrained clustering algorithm : its role is to partition the data while respecting the constraints provided by the user.
NB :
-
This python library does not contain integration into a graphic interface.
-
For more details, read the Documentation and the articles in the References section.
Documentation
Requirements
Interactive Clustering requires Python 3.6 or above.
To install Python 3.6, I recommend using pyenv
.
# install pyenv
git clone https://github.com/pyenv/pyenv ~/.pyenv
# setup pyenv (you should also put these three lines in .bashrc or similar)
export PATH="${HOME}/.pyenv/bin:${PATH}"
export PYENV_ROOT="${HOME}/.pyenv"
eval "$(pyenv init -)"
# install Python 3.6
pyenv install 3.6.12
# make it available globally
pyenv global system 3.6.12
Installation
With pip
:
# install package
python3 -m pip install cognitivefactory-interactive-clustering
# install spacy language model dependencies (the one you want, with version "^2.3")
python3 -m spacy download fr_core_news_sm-2.3.0 --direct
With pipx
:
# install pipx
python3 -m pip install --user pipx
# install package
pipx install --python python3 cognitivefactory-interactive-clustering
# install spacy language model dependencies (the one you want, with version "^2.3")
python3 -m spacy download fr_core_news_sm-2.3.0 --direct
NB : Other spaCy language models can be downloaded here : spaCy - Models & Languages. Use spacy version "^2.3"
.
Development
To work on this project or contribute to it, please read the Copier PDM documentation.
Quick setup and help
Get the code and prepare the environment:
git clone https://github.com/cognitivefactory/interactive-clustering/
cd interactive-clustering
make setup
Show the help:
make help # or just make
For more details, read the Contributing documentation.
References
-
Interactive Clustering:
- Theory and Implementation:
Schild, E., Durantin, G., Lamirel, J.C., & Miconi, F. (2021). Conception itérative et semi-supervisée d'assistants conversationnels par regroupement interactif des questions. In EGC 2021 - 21èmes Journées Francophones Extraction et Gestion des Connaissances. Edition RNTI. ⟨hal-03133007⟩
- Methodological instructions:
Schild, E., Durantin, G., & Lamirel, J.C. (2021). Concevoir un assistant conversationnel de manière itérative et semi-supervisée avec le clustering interactif. In Atelier - Fouille de Textes - Text Mine 2021 - En conjonction avec EGC 2021. ⟨hal-03133060⟩
- Theory and Implementation:
-
Constraints and Constrained Clustering:
- Constraints in clustering:
Wagstaff, K. et C. Cardie (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110.
- Survey on Constrained Clustering:
Lampert, T., T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Cremilleux, C. Vrain, et P. Gancarski (2018). Constrained distance based clustering for time-series : a comparative and experimental study. Data Mining and Knowledge Discovery 32(6), 1663–1707.
- KMeans Clustering:
- KMeans Clustering:
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1(14), 281–297.
- Constrained 'COP' KMeans Clustering:
Wagstaff, K., C. Cardie, S. Rogers, et S. Schroedl (2001). Constrained K-means Clustering with Background Knowledge. International Conference on Machine Learning
- KMeans Clustering:
- Hierarchical Clustering:
- Hierarchical Clustering:
Murtagh, F. et P. Contreras (2012). Algorithms for hierarchical clustering : An overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2, 86–97.
- Constrained Hierarchical Clustering:
Davidson, I. et S. S. Ravi (2005). Agglomerative Hierarchical Clustering with Constraints : Theoretical and Empirical Results. Springer, Berlin, Heidelberg 3721, 12.
- Hierarchical Clustering:
- Spectral Clustering:
- Spectral Clustering:
Ng, A. Y., M. I. Jordan, et Y.Weiss (2002). On Spectral Clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, et Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. MIT Press.
- Constrained 'SPEC' Spectral Clustering:
Kamvar, S. D., D. Klein, et C. D. Manning (2003). Spectral Learning. Proceedings of the international joint conference on artificial intelligence, 561–566.
- Spectral Clustering:
- Constraints in clustering:
-
Preprocessing and Vectorization:
- spaCy:
Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
- spaCy language models:
https://spacy.io/usage/models
- spaCy language models:
- NLTK:
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
- NLTK 'SnowballStemmer':
https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball
- NLTK 'SnowballStemmer':
- Scikit-learn:
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, et E. Duchesnay (2011). Scikit-learn : Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
- Scikit-learn 'TfidfVectorizer':
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- Scikit-learn 'TfidfVectorizer':
- spaCy:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cognitivefactory-interactive-clustering-0.4.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b494d54d3a721e3bec98838f44b1829d3417ceb899acdf34fff52e801012c6de |
|
MD5 | cd5709726d4ba6b2865c63d402b76fc6 |
|
BLAKE2b-256 | 91e2332c1cbb189b71f1b89e663ce5f9778717d90611ea09c754a732aa0533bb |
Hashes for cognitivefactory_interactive_clustering-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f185d598858057b57fc5d000926c239ec24ef261aff96e55b453a772e39f377 |
|
MD5 | c90b555e8b000bb96099435866665786 |
|
BLAKE2b-256 | 46f78f04391825d0b8d24c7659ea0939823daba4c9e012ba51f7c92a9d13a4f4 |