No project description provided
Project description
WordReduce
This package implements two classes, WordReduce and WordReduceLabeler, that can be used in Natural Language Processing (NLP) for (i) self-supervised explicit dimensionality reduction, (ii) parameter-free clustering, and (iii) self-supervised multilabel classification.
Key concept
WordReduce encodes a collection of raw texts (unstructured data) into a matrix (structured data) with a pre-defined number of dimensions.
The user is expected to provide the desired target of output dimensions. WordReduce then determines which words in the document collection best summarize the data, and maps all documents into the set of coordinates defined by those words.
# Usage
Low-dimensional Vectorization
from wordreduce import WordReduceLabeler
wrl = WordReduce(schema_size=100, max_df=0.01, min_df=10)
low_dim_matrix = wrl.fit_transform(retokenized)
Multilabel Classification
from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
bags_of_words = wrl.fit_transform(retokenized)
Clustering
from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
cluster_ids = wrl.fit_clusterize(retokenized)
Technical Description
Motivation: feature selection versus dimensionality reduction
Linguistic data can be transformed into structured data trivially using the Bag-of-Words (BOW) model. However, the resulting representations are high-dimensional, and cannot be easily used for other types of analysis in the context of data science problems.
High-dimensional spaces can be transformed into low-dimensional ones using dimensionality reduction techniques (e.g. LDA, PCA, NMF, SVD). However, these methods work by projecting an observable space onto a latent space and, as a result, end up as black boxes: the original structure is lost, along with its meaning, which again hampers further analysis.
WordReduce addresses this problem by returning an observable space of the desired dimensionality. Hence, it performs dimensionality reduction while also retaining explainability and interpretability. The exact methodology is described in detail below.
How does it work?
WordReduce
WordReduce bridges the gap between feature selection and dimensionality reduction by applying the following steps:
- Vectorization of the input dataset into an a BoW-TFIDF representation (by default).
- Dimensionality reduction on the vectorized dataset (Non-Negative Matrix Factorization by default).
- k-bins discretization of the latent topography resulting from the previous step. This lowers its resolution through implicit clustering and serves as a simpler version of product quantization.
- Supervised learning of a feature selection model (a decision tree in the current implementation) using the quantized embeddings as the dependent variable. Each unique discretization is encoded categorically nominally.
- Feature selection on the original input matrix using the decision tree trained on the preceding step to select units from the input representation obtained in the first step, down to the target dimensionality requested by the user.
WordReduceLabeler
WordReduceLabeler builds on top of WordReduce: it invokes implicitly to perform steps 1-5, but then returns a different output. Two options are available:
- When this class'
transformorfit_transformmethods are invoked, the class takes the original dataset as input and, for every document, returns the list of words in that document that were selected as features for describing the data. - When the class'
clusterizeorfit_clusterizemethods are invoked, for every input document an integer is returned, corresponding to that document's discretization as computed by step 3 above.
Questions
- Why parameter-free clustering? Unlike e.g. k-means, where the number of clusters k must be provided by the user,
WordReducerelies on the discretization step and infers the target number of clusters empirically as a byproduct of that step. - Why not feature selection? Because the output dimensions are not a subset of the input dimensions. No dimensions are expected as input.
- Why not dimensionality reduction? Because the output dimensions are transparent and interpretable. The latent topography is only used for self-supervision, and it is not used as the output schema.
Testing
$ cd wordreduce
$ python -m tests.wordreduce
$ pytest tests/wordreduce.py
$ pytest tests/*
Build
Instructions for building the package
- Building the package before uploading:
python -m build(from "wordreduce"). - Upload the package to pypi:
python -m twine upload --repository {pypi|testpypi} dist/* - Install the package from pypi:
python -m pip install --index-url {https://test.pypi.org/simple|https://pypi.org/simple} --no-deps wordreduce - If any dependencies are required, edit the
pyproject.tomlfile, "[project]" field, and add adependencieskey with aList\[str\]value, where each string is apip-readable dependency.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordreduce-0.0.1.tar.gz.
File metadata
- Download URL: wordreduce-0.0.1.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bfb965b6f8becc12b567fced41f906ef182ef2a41e3e27e0dc0f1009ec50522
|
|
| MD5 |
5d5e3f2f2a95993d2862b5799638f660
|
|
| BLAKE2b-256 |
8e81cb7debf434953803e9b1b5d9726d13b190ef336ecf42b581450ccfbd395d
|
File details
Details for the file wordreduce-0.0.1-py3-none-any.whl.
File metadata
- Download URL: wordreduce-0.0.1-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd83fe198964686978e47048d6de13faec1bc0342472cfd7beaad93d9ba739de
|
|
| MD5 |
8c037b2621a2c9e28227f19270f351c9
|
|
| BLAKE2b-256 |
f821874869233db5122ea919e99d6cbf194e431b25ab9d50740bee0555e05686
|