Skip to main content

A text benchmarking tool for writers and data scientists alike.

Project description

Mad Hatter

A text analysis package for authors and data scientists alike.

Mad Hatter is a Python package that provides a variety of text analysis tools. It is designed to be used by both authors and data scientists alike. It is currently in development and is not yet ready for use. The package provides the following features for text analysis:

  • Simple features (word count, sentence count, average tokens per sentence, average word length, average sentence length, etc.)
  • Advanced psycholinguistically-motivated measures (concreteness, imageability, rare word usage, etc.)
  • Context-dependent LLM-based features (surprisal, predictability, etc.)

... All optimized for data analysis, graphing, and easy integration with existing tools!

Installation

Run the following command to install the package and its dependencies:

pip install madhatter

Using NLTK features

We highly recommend also running NLTK's downloader module in order to have access to all of the features that Mad Hatter provides. To do so, simply run the following command:

python -m nltk.downloader all

Usage

The package provides high-level abstractions for text analysis that can be used with any text. The following example shows how to use the package to analyze a simple text file:

from madhatter.benchmark import CreativityBenchmark

text = "The quick brown fox jumped over the lazy dog."
bench = CreativityBenchmark(text)

bench.report()
>>> BookReport(title='unknown', nwords=10, mean_wl=3.7, mean_sl=45.0, mean_tokenspersent=10.0, prop_contentwords=0.1, mean_conc=4.0633333333333335, mean_img=5.359999999999999, mean_freq=-1.6792249660842167, prop_pos={'ADJ': 0.2, 'NOUN': 0.3, 'VERB': 0.1}, surprisal=None, predictability=None)

Command Line Interface

Mad Hatter is also available as a CLI tool. Simply provide a filename to the CLI and it will generate a report for you. The following example shows how to use the CLI to generate a report for a text file:

> python -m madhatter -h
usage: madhatter [-h] [-p] [-u] [-m MAXTOKENS] [-c CONTEXT] [-t TITLE] [-d TAGSET] filename

A command-line utility for generating book project reports.

positional arguments:
  filename              text file to parse

options:
  -h, --help            show this help message and exit
  -p, --postag          whether to return a POS tag distribution over the whole text
  -u, --usellm          whether to run GPU-intensive LLMs for additional characteristics
  -m MAXTOKENS, --maxtokens MAXTOKENS
                        maximum number of predicted tokens for the heavyweight metrics. Tokens start from the beginning of text, -1 to read until the
                        end
  -c CONTEXT, --context CONTEXT
                        context length for sliding window predictions as part of heavyweight metrics
  -t TITLE, --title TITLE
                        optional title to use for the report project.
  -d TAGSET, --tagset TAGSET
                        tagset to use

Advanced Usage

You may also choose to use the package's lower-level functions to create your own custom analysis pipeline or integrate with NLP packages such as SpaCy.

from madhatter import metrics
from madhatter import benchmark

text = "The quick brown fox jumped over the lazy dog."
bench = benchmark.CreativityBenchmark(text)

bench.words
>>> ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.']

metrics.imageability(bench.words)
>>> [1.41, 2.45, 3.14, 4.2, 3.4, 3.65, 1.41, 2.42, 4.1, 0.0]

Of course, feel free to also contribute to the package's development by opening an issue or submitting a pull request!

License

The project is released under the MIT license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

madhatter-0.0.1.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

madhatter-0.0.1-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file madhatter-0.0.1.tar.gz.

File metadata

  • Download URL: madhatter-0.0.1.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for madhatter-0.0.1.tar.gz
Algorithm Hash digest
SHA256 cddc75a14b5038391a66e5e02a4bd5b28994983d606ae63ece3d01cc866576d6
MD5 81f24bbd1e788f2b32448b47b393e6e1
BLAKE2b-256 0341ca3006a3d6ae5853f75bdf7f4cddbc0f82252b908e7bb001b0c95438c68a

See more details on using hashes here.

File details

Details for the file madhatter-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: madhatter-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for madhatter-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d4fa2c7494ee3e2b7223335ff9012d0d2a2d0a293a45e0dff8ec11a535d58b9b
MD5 4e986a24e2fc46d6c455e4ee680913e2
BLAKE2b-256 01cc6e7075e19205748f575ba46248bdd980dfda341b228382b106a1b3ab307c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page