Skip to main content

Python package used to apply NLP interactive clustering methods.

Project description

Interactive Clustering

ci documentation pypi version DOI

Python package used to apply NLP interactive clustering methods.

Quick description

Interactive clustering is a method intended to assist in the design of a training data set.

This iterative process begins with an unlabeled dataset, and it uses a sequence of two substeps :

  1. the user defines constraints on the data ;

  2. the machine performs data partitioning using a constrained clustering algorithm.

Thus, at each step of the process :

  • the user corrects the clustering of the previous steps using constraints, and

  • the machine offers a corrected and more relevant data partitioning for the next step.

The process use severals objects :

  • a constraints manager : its role is to manage the constraints annotated by the user and to feed back the information deduced (such as the transitivity between constraints or the situation of inconsistency) ;

  • a constraints sampler : its role is to select the most relevant data during the annotation of constraints by the user ;

  • a constrained clustering algorithm : its role is to partition the data while respecting the constraints provided by the user.

NB :

  • This python library does not contain integration into a graphic interface.

  • For more details, read the Documentation and the articles in the References section.

Documentation

Requirements

Interactive Clustering requires Python 3.6 or above.

To install Python 3.6, I recommend using pyenv.
# install pyenv
git clone https://github.com/pyenv/pyenv ~/.pyenv

# setup pyenv (you should also put these three lines in .bashrc or similar)
export PATH="${HOME}/.pyenv/bin:${PATH}"
export PYENV_ROOT="${HOME}/.pyenv"
eval "$(pyenv init -)"

# install Python 3.6
pyenv install 3.6.12

# make it available globally
pyenv global system 3.6.12

Installation

With pip:

# install package
python3 -m pip install cognitivefactory-interactive-clustering

# install spacy language model dependencies (the one you want, with version "^2.3")
python3 -m spacy download fr_core_news_sm-2.3.0 --direct

With pipx:

# install pipx
python3 -m pip install --user pipx

# install package
pipx install --python python3 cognitivefactory-interactive-clustering

# install spacy language model dependencies (the one you want, with version "^2.3")
python3 -m spacy download fr_core_news_sm-2.3.0 --direct

NB : Other spaCy language models can be downloaded here : spaCy - Models & Languages. Use spacy version "^2.3".

Development

To work on this project or contribute to it, please read the Copier PDM documentation.

Quick setup and help

Get the code and prepare the environment:

git clone https://github.com/cognitivefactory/interactive-clustering/
cd interactive-clustering
make setup

Show the help:

make help  # or just make

For more details, read the Contributing documentation.

References

  • Interactive Clustering:

    • Theory and Implementation: Schild, E., Durantin, G., Lamirel, J.C., & Miconi, F. (2021). Conception itérative et semi-supervisée d'assistants conversationnels par regroupement interactif des questions. In EGC 2021 - 21èmes Journées Francophones Extraction et Gestion des Connaissances. Edition RNTI. ⟨hal-03133007⟩
    • Methodological instructions: Schild, E., Durantin, G., & Lamirel, J.C. (2021). Concevoir un assistant conversationnel de manière itérative et semi-supervisée avec le clustering interactif. In Atelier - Fouille de Textes - Text Mine 2021 - En conjonction avec EGC 2021. ⟨hal-03133060⟩
  • Constraints and Constrained Clustering:

    • Constraints in clustering: Wagstaff, K. et C. Cardie (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110.
    • Survey on Constrained Clustering: Lampert, T., T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Cremilleux, C. Vrain, et P. Gancarski (2018). Constrained distance based clustering for time-series : a comparative and experimental study. Data Mining and Knowledge Discovery 32(6), 1663–1707.
    • KMeans Clustering:
      • KMeans Clustering: MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1(14), 281–297.
      • Constrained 'COP' KMeans Clustering: Wagstaff, K., C. Cardie, S. Rogers, et S. Schroedl (2001). Constrained K-means Clustering with Background Knowledge. International Conference on Machine Learning
    • Hierarchical Clustering:
      • Hierarchical Clustering: Murtagh, F. et P. Contreras (2012). Algorithms for hierarchical clustering : An overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2, 86–97.
      • Constrained Hierarchical Clustering: Davidson, I. et S. S. Ravi (2005). Agglomerative Hierarchical Clustering with Constraints : Theoretical and Empirical Results. Springer, Berlin, Heidelberg 3721, 12.
    • Spectral Clustering:
      • Spectral Clustering: Ng, A. Y., M. I. Jordan, et Y.Weiss (2002). On Spectral Clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, et Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. MIT Press.
      • Constrained 'SPEC' Spectral Clustering: Kamvar, S. D., D. Klein, et C. D. Manning (2003). Spectral Learning. Proceedings of the international joint conference on artificial intelligence, 561–566.
  • Preprocessing and Vectorization:

    • spaCy: Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
      • spaCy language models: https://spacy.io/usage/models
    • NLTK: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
      • NLTK 'SnowballStemmer': https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball
    • Scikit-learn: Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, et E. Duchesnay (2011). Scikit-learn : Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
      • Scikit-learn 'TfidfVectorizer': https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file cognitivefactory-interactive-clustering-0.4.0.tar.gz.

File metadata

  • Download URL: cognitivefactory-interactive-clustering-0.4.0.tar.gz
  • Upload date:
  • Size: 58.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for cognitivefactory-interactive-clustering-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b494d54d3a721e3bec98838f44b1829d3417ceb899acdf34fff52e801012c6de
MD5 cd5709726d4ba6b2865c63d402b76fc6
BLAKE2b-256 91e2332c1cbb189b71f1b89e663ce5f9778717d90611ea09c754a732aa0533bb

See more details on using hashes here.

File details

Details for the file cognitivefactory_interactive_clustering-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cognitivefactory_interactive_clustering-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f185d598858057b57fc5d000926c239ec24ef261aff96e55b453a772e39f377
MD5 c90b555e8b000bb96099435866665786
BLAKE2b-256 46f78f04391825d0b8d24c7659ea0939823daba4c9e012ba51f7c92a9d13a4f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page