Python command line application to add text features to a CSV or TSV dataset.

Project description

texturizer

Status - Functional

This is an application to add features to a dataset that are derived from processing the content of existing columns of text data. It is specifically designed for adding somewhat bespoke and unusual features that are not particularly well identified by n-gram or word embedding approaches.

It will accept a CSV, TSV or XLS file and output an extended version of the dataset with additional columns appended. When run with default settings it will produce a small number of very simple numerical summaries.

Additional feature flags unlock features that are more computationally intensive and generally domain specific.

Released and distributed via setuptools/PyPI/pip for Python 3.

Additional detail available in the documentation

TODO

Current features are all derived from single records. Future development will add these
in some sense relative to a corpus.

* Add capacity to generate features relative to corpus averages
* Add capacity for comparison features to be generated relative to reference text(s)
* Investigate functionality for working with unix shell pipes and streams

Distribution

Released and distributed via setuptools/PyPI/pip for Python 3.

Resources & Dependencies

For Part of Speech Tagging we use spacy

Note: After install you will need to get spaCy to download the English model.

sudo python3 -m spacy download en

For string based text comparisons we use jellyfish and textdistance

Features

Each type of feature can be unlocked through the use of a specific command line switch:

-topics. Indicators for presence of words from common topics.
-topics=count. Counts of all word matches from common topics.
-pos. Part of speech proportions in the text.
-literacy. Checks for common literacy markers.
-traits. Checks for common stylistic elements or traits that suggest personality type.
-rhetoric. Checks for rhetorical devices used for persuasion
-profanity. Profanity check flags.
-sentiment. Sentiment word counts and score.
-emoticons. Emoticon check flags.
-comparison. Cross-column comparisons using edit distance metrics

Usage

You can use this application multiple ways

Use the runner without installing the application. The following example will generate all features on the test data.

./texturizer-runner.py -columns=question,answer -pos -literacy -traits -rhetoric -profanity -emoticons -sentiment -comparison -topics=count data/test.csv > data/output.csv

This will send the time performance profile to STDERR as shown below:

Computation Time Profile for each Feature Set
---------------------------------------------
simple               0:00:00.580910
comparison           0:00:00.490972
profanity            0:00:00.507172
sentiment            0:00:03.611817
emoticons            0:00:00.387556
topics               0:00:02.778537
traits               0:00:00.262633
rhetoric             0:00:02.107620
pos                  0:00:22.130724
literacy             0:00:00.488886

As you can see the part of speech (POS) features are the most time consuming to generate. It is worth avoiding them on very large datasets.

Alternatively, you can invoke the directory as a package:

python -m texturizer -columns=question,answer data/test.csv > data/output.csv

Or simply install the package and use the command line application directly

Installation

Installation from the source tree:

python setup.py install

(or via pip from PyPI):

pip install texturizer

You will then need to run the POST INSTALL SCRIPT to install the required Spacy Model (otherwise the POS features cannot be calculated).

Now, the texturizer command is available::

texturizer -columns=question,answer -topics data/test.csv > data/output.csv

This will take the Input CSV, calculate some simple summary features and produce an Output CSV with features appended as new columns.

For more complicated features see the additional options (outlined above).

Acknowledgements

Python package built using the bootstrap cmdline template by jgehrcke

Project details

Release history Release notifications | RSS feed

0.2.0

Jul 28, 2025

0.1.9

Feb 26, 2022

0.1.8

Jun 2, 2021

0.1.7

Jun 2, 2021

0.1.6

Jun 1, 2021

0.1.5

May 31, 2021

0.1.4

May 29, 2021

This version

0.1.3

May 25, 2021

0.1.1

Sep 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texturizer-0.1.3.tar.gz (56.5 kB view details)

Uploaded May 25, 2021 Source

File details

Details for the file texturizer-0.1.3.tar.gz.

File metadata

Download URL: texturizer-0.1.3.tar.gz
Upload date: May 25, 2021
Size: 56.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/51.1.0 requests-toolbelt/0.8.0 tqdm/4.48.2 CPython/3.6.4

File hashes

Hashes for texturizer-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`80e1c7b25c32ab6de787b7ad4520f0c674c91eb76376839a3488904281ed6382`
MD5	`6a9a4058a837b649dcf4eddc849ed077`
BLAKE2b-256	`172027c50cd3562f7a5336dec8fc4eebb8a194962d35471528cca2c3a154e6e6`

See more details on using hashes here.

texturizer 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta