Python package used to apply NLP interactive clustering methods.
Project description
Interactive Clustering
Python package used to apply NLP interactive clustering methods.
Quick description
Interactive clustering is a method intended to assist in the design of a training data set.
This iterative process begins with an unlabeled dataset, and it uses a sequence of two substeps :
-
the user defines constraints on data sampled by the computer ;
-
the computer performs data partitioning using a constrained clustering algorithm.
Thus, at each step of the process :
-
the user corrects the clustering of the previous steps using constraints, and
-
the computer offers a corrected and more relevant data partitioning for the next step.
The process use severals objects :
-
a constraints manager : its role is to manage the constraints annotated by the user and to feed back the information deduced (such as the transitivity between constraints or the situation of inconsistency) ;
-
a constraints sampler : its role is to select the most relevant data during the annotation of constraints by the user ;
-
a constrained clustering algorithm : its role is to partition the data while respecting the constraints provided by the user.
NB :
-
This python library does not contain integration into a graphic interface.
-
For more details, read the Documentation and the articles in the References section.
Documentation
Installation
Interactive Clustering requires Python 3.8 or above.
To install with pip
:
# install package
python3 -m pip install cognitivefactory-interactive-clustering
# install spacy language model dependencies (the one you want, with version "3.4.x")
python3 -m spacy download fr_core_news_md-3.4.0 --direct
To install with pipx
:
# install pipx
python3 -m pip install --user pipx
# install package
pipx install --python python3 cognitivefactory-interactive-clustering
# install spacy language model dependencies (the one you want, with version "3.4.x")
python3 -m spacy download fr_core_news_md-3.4.0 --direct
NB : Other spaCy language models can be downloaded here : spaCy - Models & Languages. Use spacy version "3.4.x"
.
Development
To work on this project or contribute to it, please read:
- the Copier PDM template documentation ;
- the Contributing page for environment setup and development help ;
- the Code of Conduct page for contribution rules.
References
-
Interactive Clustering:
- PhD report:
Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l'Annotation : Application à la Modélisation de Textes en Intentions à l'aide d'un Clustering Interactif. Université de Lorraine.
; - First presentation:
Schild, E., Durantin, G., Lamirel, J.C., & Miconi, F. (2021). Conception itérative et semi-supervisée d'assistants conversationnels par regroupement interactif des questions. In EGC 2021 - 21èmes Journées Francophones Extraction et Gestion des Connaissances. Edition RNTI. <hal-03133007>.
- Theoretical study:
Schild, E., Durantin, G., Lamirel, J., & Miconi, F. (2022). Iterative and Semi-Supervised Design of Chatbots Using Interactive Clustering. International Journal of Data Warehousing and Mining (IJDWM), 18(2), 1-19. http://doi.org/10.4018/IJDWM.298007. <hal-03648041>.
- Methodological discussion:
Schild, E., Durantin, G., & Lamirel, J.C. (2021). Concevoir un assistant conversationnel de manière itérative et semi-supervisée avec le clustering interactif. In Atelier - Fouille de Textes - Text Mine 2021 - En conjonction avec EGC 2021. <hal-03133060>.
- PhD report:
-
Constraints and Constrained Clustering:
- Constraints in clustering:
Wagstaff, K. et C. Cardie (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110.
- Survey on Constrained Clustering:
Lampert, T., T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Cremilleux, C. Vrain, et P. Gancarski (2018). Constrained distance based clustering for time-series : a comparative and experimental study. Data Mining and Knowledge Discovery 32(6), 1663–1707.
- Affinity Propagation:
- Affinity Propagation Clustering:
Frey, B. J., & Dueck, D. (2007). Clustering by Passing Messages Between Data Points. In Science (Vol. 315, Issue 5814, pp. 972–976). American Association for the Advancement of Science (AAAS). https://doi.org/10.1126/science.1136800
- Constrained Affinity Propagation Clustering:
Givoni, I., & Frey, B. J. (2009). Semi-Supervised Affinity Propagation with Instance-Level Constraints. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:161-168
- Affinity Propagation Clustering:
- DBScan:
- DBScan Clustering:
Ester, Martin & Kröger, Peer & Sander, Joerg & Xu, Xiaowei. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD. 96. 226-231
. - Constrained DBScan Clustering:
Ruiz, Carlos & Spiliopoulou, Myra & Menasalvas, Ernestina. (2007). C-DBSCAN: Density-Based Clustering with Constraints. 216-223. 10.1007/978-3-540-72530-5_25.
- DBScan Clustering:
- KMeans Clustering:
- KMeans Clustering:
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1(14), 281–297.
- Constrained 'COP' KMeans Clustering:
Wagstaff, K., C. Cardie, S. Rogers, et S. Schroedl (2001). Constrained K-means Clustering with Background Knowledge. International Conference on Machine Learning
- Constrained 'MPC' KMeans Clustering:
Khan, Md. A., Tamim, I., Ahmed, E., & Awal, M. A. (2012). Multiple Parameter Based Clustering (MPC): Prospective Analysis for Effective Clustering in Wireless Sensor Network (WSN) Using K-Means Algorithm. In Wireless Sensor Network (Vol. 04, Issue 01, pp. 18–24). Scientific Research Publishing, Inc. https://doi.org/10.4236/wsn.2012.41003
- KMeans Clustering:
- Hierarchical Clustering:
- Hierarchical Clustering:
Murtagh, F. et P. Contreras (2012). Algorithms for hierarchical clustering : An overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2, 86–97.
- Constrained Hierarchical Clustering:
Davidson, I. et S. S. Ravi (2005). Agglomerative Hierarchical Clustering with Constraints : Theoretical and Empirical Results. Springer, Berlin, Heidelberg 3721, 12.
- Hierarchical Clustering:
- Spectral Clustering:
- Spectral Clustering:
Ng, A. Y., M. I. Jordan, et Y.Weiss (2002). On Spectral Clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, et Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. MIT Press.
- Constrained 'SPEC' Spectral Clustering:
Kamvar, S. D., D. Klein, et C. D. Manning (2003). Spectral Learning. Proceedings of the international joint conference on artificial intelligence, 561–566.
- Spectral Clustering:
- Constraints in clustering:
-
Preprocessing and Vectorization:
- spaCy:
Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
- spaCy language models:
https://spacy.io/usage/models
- spaCy language models:
- NLTK:
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
- NLTK 'SnowballStemmer':
https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball
- NLTK 'SnowballStemmer':
- Scikit-learn:
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, et E. Duchesnay (2011). Scikit-learn : Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
- Scikit-learn 'TfidfVectorizer':
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- Scikit-learn 'TfidfVectorizer':
- spaCy:
Other links
- Several comparative studies of Interactive Clustering methodology on NLP datasets:
Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255
- A web application designed for NLP data annotation using Interactive Clustering methodology:
Schild, E. (2021). cognitivefactory/interactive-clustering-gui. Zenodo. https://doi.org/10.5281/zenodo.4775270
How to cite
Schild, E. (2021). cognitivefactory/interactive-clustering. Zenodo. https://doi.org/10.5281/zenodo.4775251.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cognitivefactory-interactive-clustering-1.0.0.tar.gz
.
File metadata
- Download URL: cognitivefactory-interactive-clustering-1.0.0.tar.gz
- Upload date:
- Size: 91.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75f2bf6d7ffd9f06f2081d7616ecd1d62247245462ffeb03a0299aa92a31bc06 |
|
MD5 | d9c1a2badc028da8ed8f528ba940e6a0 |
|
BLAKE2b-256 | 1b615098ac346370a1cfed3e204d8d902ddae4ef0ee1b6c964bbe2fee415f48a |
File details
Details for the file cognitivefactory_interactive_clustering-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: cognitivefactory_interactive_clustering-1.0.0-py3-none-any.whl
- Upload date:
- Size: 75.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0915041a4f5eff6918a406e2615ff8e65948cda2cdfbd2142e9b84d9d577a2af |
|
MD5 | a3374978259b2dd35638b4c1c370f74e |
|
BLAKE2b-256 | 05efe54c749c88cfd439d1c5950c7e3658a0bba571efdb7b3c874ca7cccc8b54 |