No project description provided

These details have not been verified by PyPI

Project links

GitHub repository page

Project description

WordReduce

This package implements two classes, WordReduce and WordReduceLabeler, that can be used in Natural Language Processing (NLP) for (i) self-supervised explicit dimensionality reduction, (ii) parameter-free clustering, and (iii) self-supervised multilabel classification.

Key concept

WordReduce encodes a collection of raw texts (unstructured data) into a matrix (structured data) with a pre-defined number of dimensions.

The user is expected to provide the desired target of output dimensions. WordReduce then determines which words in the document collection best summarize the data, and maps all documents into the set of coordinates defined by those words.

# Usage

Low-dimensional Vectorization

from wordreduce import WordReduceLabeler
wrl = WordReduce(schema_size=100, max_df=0.01, min_df=10)
low_dim_matrix = wrl.fit_transform(retokenized)

Multilabel Classification

from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
bags_of_words = wrl.fit_transform(retokenized)

Clustering

from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
cluster_ids = wrl.fit_clusterize(retokenized)

Technical Description

Motivation: feature selection versus dimensionality reduction

Linguistic data can be transformed into structured data trivially using the Bag-of-Words (BOW) model. However, the resulting representations are high-dimensional, and cannot be easily used for other types of analysis in the context of data science problems.

High-dimensional spaces can be transformed into low-dimensional ones using dimensionality reduction techniques (e.g. LDA, PCA, NMF, SVD). However, these methods work by projecting an observable space onto a latent space and, as a result, end up as black boxes: the original structure is lost, along with its meaning, which again hampers further analysis.

WordReduce addresses this problem by returning an observable space of the desired dimensionality. Hence, it performs dimensionality reduction while also retaining explainability and interpretability. The exact methodology is described in detail below.

How does it work?

WordReduce

WordReduce bridges the gap between feature selection and dimensionality reduction by applying the following steps:

Vectorization of the input dataset into an a BoW-TFIDF representation (by default).
Dimensionality reduction on the vectorized dataset (Non-Negative Matrix Factorization by default).
k-bins discretization of the latent topography resulting from the previous step. This lowers its resolution through implicit clustering and serves as a simpler version of product quantization.
Supervised learning of a feature selection model (a decision tree in the current implementation) using the quantized embeddings as the dependent variable. Each unique discretization is encoded categorically nominally.
Feature selection on the original input matrix using the decision tree trained on the preceding step to select units from the input representation obtained in the first step, down to the target dimensionality requested by the user.

WordReduceLabeler

WordReduceLabeler builds on top of WordReduce: it invokes implicitly to perform steps 1-5, but then returns a different output. Two options are available:

When this class' transform or fit_transform methods are invoked, the class takes the original dataset as input and, for every document, returns the list of words in that document that were selected as features for describing the data.
When the class' clusterize or fit_clusterize methods are invoked, for every input document an integer is returned, corresponding to that document's discretization as computed by step 3 above.

Questions

Why parameter-free clustering? Unlike e.g. k-means, where the number of clusters k must be provided by the user, WordReduce relies on the discretization step and infers the target number of clusters empirically as a byproduct of that step.
Why not feature selection? Because the output dimensions are not a subset of the input dimensions. No dimensions are expected as input.
Why not dimensionality reduction? Because the output dimensions are transparent and interpretable. The latent topography is only used for self-supervision, and it is not used as the output schema.

Testing

$ cd wordreduce
$ python -m tests.wordreduce
$ pytest tests/wordreduce.py
$ pytest tests/*

Build

Instructions for building the package

Building the package before uploading: python -m build (from "wordreduce").
Upload the package to pypi: python -m twine upload --repository {pypi|testpypi} dist/*
Install the package from pypi: python -m pip install --index-url {https://test.pypi.org/simple|https://pypi.org/simple} --no-deps wordreduce
If any dependencies are required, edit the pyproject.toml file, "[project]" field, and add a dependencies key with a List\[str\] value, where each string is a pip-readable dependency.

Project details

These details have not been verified by PyPI

Project links

GitHub repository page

Release history Release notifications | RSS feed

This version

0.0.1

Dec 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordreduce-0.0.1.tar.gz (5.9 kB view details)

Uploaded Dec 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wordreduce-0.0.1-py3-none-any.whl (6.3 kB view details)

Uploaded Dec 25, 2024 Python 3

File details

Details for the file wordreduce-0.0.1.tar.gz.

File metadata

Download URL: wordreduce-0.0.1.tar.gz
Upload date: Dec 25, 2024
Size: 5.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.0 CPython/3.11.9

File hashes

Hashes for wordreduce-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6bfb965b6f8becc12b567fced41f906ef182ef2a41e3e27e0dc0f1009ec50522`
MD5	`5d5e3f2f2a95993d2862b5799638f660`
BLAKE2b-256	`8e81cb7debf434953803e9b1b5d9726d13b190ef336ecf42b581450ccfbd395d`

See more details on using hashes here.

File details

Details for the file wordreduce-0.0.1-py3-none-any.whl.

File metadata

Download URL: wordreduce-0.0.1-py3-none-any.whl
Upload date: Dec 25, 2024
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.0 CPython/3.11.9

File hashes

Hashes for wordreduce-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd83fe198964686978e47048d6de13faec1bc0342472cfd7beaad93d9ba739de`
MD5	`8c037b2621a2c9e28227f19270f351c9`
BLAKE2b-256	`f821874869233db5122ea919e99d6cbf194e431b25ab9d50740bee0555e05686`

See more details on using hashes here.

wordreduce 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WordReduce

Key concept

Low-dimensional Vectorization

Multilabel Classification

Clustering

Technical Description

Motivation: feature selection versus dimensionality reduction

How does it work?

WordReduce

WordReduceLabeler

Questions

Testing

Build

Instructions for building the package

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes