Skip to main content

No project description provided

Project description

WordReduce

This package implements two classes, WordReduce and WordReduceLabeler, that can be used in Natural Language Processing (NLP) for (i) self-supervised explicit dimensionality reduction, (ii) parameter-free clustering, and (iii) self-supervised multilabel classification.

Key concept

WordReduce encodes a collection of raw texts (unstructured data) into a matrix (structured data) with a pre-defined number of dimensions.

The user is expected to provide the desired target of output dimensions. WordReduce then determines which words in the document collection best summarize the data, and maps all documents into the set of coordinates defined by those words.

# Usage

Low-dimensional Vectorization

from wordreduce import WordReduceLabeler
wrl = WordReduce(schema_size=100, max_df=0.01, min_df=10)
low_dim_matrix = wrl.fit_transform(retokenized)

Multilabel Classification

from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
bags_of_words = wrl.fit_transform(retokenized)

Clustering

from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
cluster_ids = wrl.fit_clusterize(retokenized)

Technical Description

Motivation: feature selection versus dimensionality reduction

Linguistic data can be transformed into structured data trivially using the Bag-of-Words (BOW) model. However, the resulting representations are high-dimensional, and cannot be easily used for other types of analysis in the context of data science problems.

High-dimensional spaces can be transformed into low-dimensional ones using dimensionality reduction techniques (e.g. LDA, PCA, NMF, SVD). However, these methods work by projecting an observable space onto a latent space and, as a result, end up as black boxes: the original structure is lost, along with its meaning, which again hampers further analysis.

WordReduce addresses this problem by returning an observable space of the desired dimensionality. Hence, it performs dimensionality reduction while also retaining explainability and interpretability. The exact methodology is described in detail below.

How does it work?

WordReduce

WordReduce bridges the gap between feature selection and dimensionality reduction by applying the following steps:

  1. Vectorization of the input dataset into an a BoW-TFIDF representation (by default).
  2. Dimensionality reduction on the vectorized dataset (Non-Negative Matrix Factorization by default).
  3. k-bins discretization of the latent topography resulting from the previous step. This lowers its resolution through implicit clustering and serves as a simpler version of product quantization.
  4. Supervised learning of a feature selection model (a decision tree in the current implementation) using the quantized embeddings as the dependent variable. Each unique discretization is encoded categorically nominally.
  5. Feature selection on the original input matrix using the decision tree trained on the preceding step to select units from the input representation obtained in the first step, down to the target dimensionality requested by the user.
WordReduceLabeler

WordReduceLabeler builds on top of WordReduce: it invokes implicitly to perform steps 1-5, but then returns a different output. Two options are available:

  1. When this class' transform or fit_transform methods are invoked, the class takes the original dataset as input and, for every document, returns the list of words in that document that were selected as features for describing the data.
  2. When the class' clusterize or fit_clusterize methods are invoked, for every input document an integer is returned, corresponding to that document's discretization as computed by step 3 above.

Questions

  • Why parameter-free clustering? Unlike e.g. k-means, where the number of clusters k must be provided by the user, WordReduce relies on the discretization step and infers the target number of clusters empirically as a byproduct of that step.
  • Why not feature selection? Because the output dimensions are not a subset of the input dimensions. No dimensions are expected as input.
  • Why not dimensionality reduction? Because the output dimensions are transparent and interpretable. The latent topography is only used for self-supervision, and it is not used as the output schema.

Testing

$ cd wordreduce
$ python -m tests.wordreduce
$ pytest tests/wordreduce.py
$ pytest tests/*

Build

Instructions for building the package

  1. Building the package before uploading: python -m build (from "wordreduce").
  2. Upload the package to pypi: python -m twine upload --repository {pypi|testpypi} dist/*
  3. Install the package from pypi: python -m pip install --index-url {https://test.pypi.org/simple|https://pypi.org/simple} --no-deps wordreduce
  4. If any dependencies are required, edit the pyproject.toml file, "[project]" field, and add a dependencies key with a List\[str\] value, where each string is a pip-readable dependency.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordreduce-0.0.1.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordreduce-0.0.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file wordreduce-0.0.1.tar.gz.

File metadata

  • Download URL: wordreduce-0.0.1.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.0 CPython/3.11.9

File hashes

Hashes for wordreduce-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6bfb965b6f8becc12b567fced41f906ef182ef2a41e3e27e0dc0f1009ec50522
MD5 5d5e3f2f2a95993d2862b5799638f660
BLAKE2b-256 8e81cb7debf434953803e9b1b5d9726d13b190ef336ecf42b581450ccfbd395d

See more details on using hashes here.

File details

Details for the file wordreduce-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: wordreduce-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.0 CPython/3.11.9

File hashes

Hashes for wordreduce-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fd83fe198964686978e47048d6de13faec1bc0342472cfd7beaad93d9ba739de
MD5 8c037b2621a2c9e28227f19270f351c9
BLAKE2b-256 f821874869233db5122ea919e99d6cbf194e431b25ab9d50740bee0555e05686

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page