TextUnitLib: A Python library for extracting diverse text units from textual data

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Halvani

These details have not been verified by PyPI

Project description

TextUnitLib (TUL)

A Python library that allows easy extraction of a variety of text units within texts

Description

TextUnitLib (TUL) enables effortless extraction of a variety of text units from texts, which can for example be used to extend existing Natural Language Processing (NLP) applications. In addition to common text units such as words, parts of speech (POS) or named entities, more specific text units such as function words, stop words, contractions, numerals (e.g., pronounced numbers), quotations, emojis and many more can be extracted. These can be used to carry out in-depth analyses of given texts and thus gain valuable insights (e.g., stylometric analyses). In addition, TUL can be used to simplify the pre-processing and cleaning of texts, the construction of feature vectors or many (corpus) linguistic tasks such as the creation of word/vocabulary lists, cloze texts and readability formulas. TUL can be used either standalone or as a building block for larger NLP applications.

Applications

TUL's feature extraction abilities allow a wide range of applications, including:

Text analytic / corpus linguistics purposes (e.g., calculating text statistics, readability measures, authorship analysis, etc.)
Feature vector construction for many NLP tasks (in particular text classification)
Accessing linguistic features for visualization purposes (e.g., word clouds, plots)
Pre-processing / cleaning of text files within text datasets (e.g., anonymizing named entities, removing stopwords, dates, URLs, etc.)
General framework PDF document annotations (e.g., highlighting words by their POS-tags)

Features

Besides common text units (e.g., tokens, letters, numbers or POS-tags) TUL also covers many less popular text units
Provides functions to extract generic linguistic features (e.g., n-gram-based features, text units that occur x times, maximum substrings occuring within a list of text units, etc.)
Multilingual (TUL currently supports two languages, more languages will follow)
Automatic NLP pipeline creation (installation and loading of the spaCy models on demand)
No API dependency: besides the spaCy models and obligatory Python libraries TUL can be used completely offline
Extensively documented source code with many examples integrated into the docstrings

Categories of text units

Numerals: Integers, floats, decimals (0 to 9), digits, numerals, spelling out numbers
Function word sub-categories: Conjunctions, auxiliary verbs, determiners, prepositions, pronouns, quantifiers
N-Grams: Character n-grams, word n-grams, token n-grams, POS-tag n-grams, etc.
Emojis: As visual pictograms or as shortcodes
Hapax/dis/tris legomenon text units
Quotations

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Halvani

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.1

Nov 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textunitlib-0.0.1.tar.gz (37.2 kB view details)

Uploaded Nov 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textunitlib-0.0.1-py3-none-any.whl (44.6 kB view details)

Uploaded Nov 16, 2025 Python 3

File details

Details for the file textunitlib-0.0.1.tar.gz.

File metadata

Download URL: textunitlib-0.0.1.tar.gz
Upload date: Nov 16, 2025
Size: 37.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textunitlib-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`fd6238500464ac3d912ebc8c5f55f284b52da52d6a22de6bfff71e7e6b7af135`
MD5	`7a99c5ead8adfd4aff549eb6f55e4ca7`
BLAKE2b-256	`c50318f8a2c9a72af05b98eb4d1af3c56c7b4df7f3b0ecce8fa91c6cf657d9d9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for textunitlib-0.0.1.tar.gz:

Publisher: python-publish.yml on Halvani/TextUnitLib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: textunitlib-0.0.1.tar.gz
- Subject digest: fd6238500464ac3d912ebc8c5f55f284b52da52d6a22de6bfff71e7e6b7af135
- Sigstore transparency entry: 702200943
- Sigstore integration time: Nov 16, 2025
Source repository:
- Permalink: Halvani/TextUnitLib@3e1f6b6db74fa8e995b451a8e57757886077e260
- Branch / Tag: refs/tags/v.0.0.1
- Owner: https://github.com/Halvani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@3e1f6b6db74fa8e995b451a8e57757886077e260
- Trigger Event: release

File details

Details for the file textunitlib-0.0.1-py3-none-any.whl.

File metadata

Download URL: textunitlib-0.0.1-py3-none-any.whl
Upload date: Nov 16, 2025
Size: 44.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textunitlib-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b4f2898698220d792dedab6a8ea1575fbbb22904556fe61c509aa2724450d16`
MD5	`afcb62c245b107e951dd3c621d78d784`
BLAKE2b-256	`c67b63cbcfd209f463b2325b918da55b4ce8a3865ab045cc1030048aef853792`

See more details on using hashes here.

Provenance

The following attestation bundles were made for textunitlib-0.0.1-py3-none-any.whl:

Publisher: python-publish.yml on Halvani/TextUnitLib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: textunitlib-0.0.1-py3-none-any.whl
- Subject digest: 9b4f2898698220d792dedab6a8ea1575fbbb22904556fe61c509aa2724450d16
- Sigstore transparency entry: 702200944
- Sigstore integration time: Nov 16, 2025
Source repository:
- Permalink: Halvani/TextUnitLib@3e1f6b6db74fa8e995b451a8e57757886077e260
- Branch / Tag: refs/tags/v.0.0.1
- Owner: https://github.com/Halvani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@3e1f6b6db74fa8e995b451a8e57757886077e260
- Trigger Event: release

textunitlib 0.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TextUnitLib (TUL)

Description

Applications

Features

Categories of text units

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance