Skip to main content

TextUnitLib: A Python library for extracting diverse text units from textual data

Project description

TextUnitLib (TUL)

A Python library that allows easy extraction of a variety of text units within texts

Description

TextUnitLib (TUL) enables effortless extraction of a variety of text units from texts, which can for example be used to extend existing Natural Language Processing (NLP) applications. In addition to common text units such as words, parts of speech (POS) or named entities, more specific text units such as function words, stop words, contractions, numerals (e.g., pronounced numbers), quotations, emojis and many more can be extracted. These can be used to carry out in-depth analyses of given texts and thus gain valuable insights (e.g., stylometric analyses). In addition, TUL can be used to simplify the pre-processing and cleaning of texts, the construction of feature vectors or many (corpus) linguistic tasks such as the creation of word/vocabulary lists, cloze texts and readability formulas. TUL can be used either standalone or as a building block for larger NLP applications.

Applications

TUL's feature extraction abilities allow a wide range of applications, including:

  • Text analytic / corpus linguistics purposes (e.g., calculating text statistics, readability measures, authorship analysis, etc.)
  • Feature vector construction for many NLP tasks (in particular text classification)
  • Accessing linguistic features for visualization purposes (e.g., word clouds, plots)
  • Pre-processing / cleaning of text files within text datasets (e.g., anonymizing named entities, removing stopwords, dates, URLs, etc.)
  • General framework PDF document annotations (e.g., highlighting words by their POS-tags)

Features

  • Besides common text units (e.g., tokens, letters, numbers or POS-tags) TUL also covers many less popular text units
  • Provides functions to extract generic linguistic features (e.g., n-gram-based features, text units that occur x times, maximum substrings occuring within a list of text units, etc.)
  • Multilingual (TUL currently supports two languages, more languages will follow)
  • Automatic NLP pipeline creation (installation and loading of the spaCy models on demand)
  • No API dependency: besides the spaCy models and obligatory Python libraries TUL can be used completely offline
  • Extensively documented source code with many examples integrated into the docstrings

Categories of text units

  • Numerals: Integers, floats, decimals (0 to 9), digits, numerals, spelling out numbers
  • Function word sub-categories: Conjunctions, auxiliary verbs, determiners, prepositions, pronouns, quantifiers
  • N-Grams: Character n-grams, word n-grams, token n-grams, POS-tag n-grams, etc.
  • Emojis: As visual pictograms or as shortcodes
  • Hapax/dis/tris legomenon text units
  • Quotations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textunitlib-0.0.1.tar.gz (37.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textunitlib-0.0.1-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file textunitlib-0.0.1.tar.gz.

File metadata

  • Download URL: textunitlib-0.0.1.tar.gz
  • Upload date:
  • Size: 37.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textunitlib-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fd6238500464ac3d912ebc8c5f55f284b52da52d6a22de6bfff71e7e6b7af135
MD5 7a99c5ead8adfd4aff549eb6f55e4ca7
BLAKE2b-256 c50318f8a2c9a72af05b98eb4d1af3c56c7b4df7f3b0ecce8fa91c6cf657d9d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for textunitlib-0.0.1.tar.gz:

Publisher: python-publish.yml on Halvani/TextUnitLib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file textunitlib-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: textunitlib-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textunitlib-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9b4f2898698220d792dedab6a8ea1575fbbb22904556fe61c509aa2724450d16
MD5 afcb62c245b107e951dd3c621d78d784
BLAKE2b-256 c67b63cbcfd209f463b2325b918da55b4ce8a3865ab045cc1030048aef853792

See more details on using hashes here.

Provenance

The following attestation bundles were made for textunitlib-0.0.1-py3-none-any.whl:

Publisher: python-publish.yml on Halvani/TextUnitLib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page