A writing script-aware library for cleaning text for NLP, training and analysis.

Project description

Unscript: Multilingual Text Cleaning

Unscript is a Python package designed for robust and flexible text cleaning, particularly for multilingual data. It provides functions to sanitize text by removing unwanted elements like mentions, hashtags, URLs, and emojis, and to filter text based on specific Unicode script ranges.

Installation

To install Unscript, you can use pip:

pip install unscript

Quick Start

from unscript import unscript, clean_text, clean_script

# Most common use case: complete text cleaning for a specific script
text = "Hello @user! Check https://example.com 😊 مرحبا $123.45"
result = unscript("Latn", text, {"numbers": True, "symbols": True})
print(result)  # Output: "hello check $123.45"

# For general cleaning without script filtering
clean_result = clean_text(text)
print(clean_result)  # Output: "hello ! check مرحبا $123.45"

# For script filtering only (keeps original case, URLs, mentions)
script_result = clean_script("Latn", text, {"numbers": True, "symbols": True})
print(script_result)  # Output: "Hello @user Check https //example com 😊 $123.45"

Functions

`unscript(script: str, text: str, config: dict = None, lowercase: bool = True) -> str`

This is the primary function that combines script filtering with general text cleaning in an optimal pipeline. It first applies clean_text to remove mentions, URLs, and emojis, then applies clean_script to filter by the specified Unicode script.

Arguments:

script (str): The Unicode script code (e.g., 'Latn', 'Arab', 'Hans').
text (str): The text string to be cleaned.
config (dict, optional): Configuration for script filtering. Defaults to {'spaces': True, 'numbers': False, 'punctuation': False, 'symbols': False}.
lowercase (bool, optional): Whether to convert text to lowercase. Defaults to True.

Returns:

str: Cleaned text containing only characters from the specified script, with mentions, URLs, and other noise removed.

Example Usage:

from unscript.unscript import unscript

# Basic usage with Latin script
text1 = "Hello @user! Check https://example.com 😊 مرحبا"
result1 = unscript("Latn", text1)
print(result1)
# Expected output: "hello check"

# Arabic script with punctuation
text2 = "مرحبا @user بالعالم! https://example.com"
result2 = unscript("Arab", text2, {"punctuation": True})
print(result2)
# Expected output: "مرحبا بالعالم!"

# Latin script with numbers and symbols
text3 = "Price: $123.45 @user!"
result3 = unscript("Latn", text3, {"numbers": True, "symbols": True})
print(result3)
# Expected output: "price $123.45"

# Preserve case
text4 = "HELLO @user WORLD!"
result4 = unscript("Latn", text4, lowercase=False)
print(result4)
# Expected output: "HELLO WORLD"

`clean_text(text: str, lowercase: bool = True) -> str`

This function provides a general-purpose text cleaning utility. It's designed to prepare raw text for analysis by removing common noisy elements like mentions, URLs, and emojis. Note: For script-specific filtering (removing punctuation, symbols, etc.), use clean_script or the unscript function.

Features:

Removes @mentions, @@mentions, and +mentions.
Removes #hashtags.
Removes URLs (e.g., http://, https://, ftp://, www., and email addresses).
Removes domain names (e.g., example.com) but preserves decimal numbers (e.g., 123.45).
Removes emojis.
Normalizes Unicode characters.
Converts text to lowercase (optional with lowercase parameter).
Collapses repeating characters to a maximum of two characters (e.g., "coooooolllll" becomes "cooll"), except for numbers.
Replaces newlines and tabs with spaces.
Collapses multiple spaces into single spaces.
Returns an empty string if the cleaned text consists only of numbers.

Example Usage:

from unscript.unscript import clean_text

text1 = "Hello world! This is a test @user #python https://example.com 😊 coooooolllll"
cleaned_text1 = clean_text(text1)
print(cleaned_text1)
# Expected output: "hello world! this is a test cooll"

text2 = "Price is $123.45 @user"
cleaned_text2 = clean_text(text2)
print(cleaned_text2)
# Expected output: "price is $123.45"

# Preserve case
text3 = "Hello WORLD @user"
cleaned_text3 = clean_text(text3, lowercase=False)
print(cleaned_text3)
# Expected output: "Hello WORLD"

`clean_script(script: str, text: str, config: dict = None) -> str`

This function filters text to include only characters belonging to a specified Unicode script, with configurable options for numbers, punctuation, and symbols. It's ideal for tasks requiring strict script adherence.

Arguments:

script (str): The Unicode script code (e.g., 'Latn', 'Arab', 'Hans').
text (str): The text string to be cleaned.
config (dict, optional): A dictionary to customize character inclusion. Defaults to {'spaces': True, 'numbers': False, 'punctuation': False, 'symbols': False}.
- 'spaces' (bool): Include common whitespace characters (default: True).
- 'numbers' (bool): Include digits (e.g., '0-9', Arabic, Devanagari digits) (default: False).
- 'punctuation' (bool): Include common and script-specific punctuation marks (default: False).
- 'symbols' (bool): Include various symbols (e.g., currency, mathematical) (default: False).

Behavior:

Characters not belonging to the specified script or excluded by the config are replaced with spaces.
Multiple spaces are collapsed into a single space.
Priority for overlapping ranges: If a character falls into multiple categories, the more specific one takes precedence (punctuation > numbers > symbols). This ensures correct filtering.

Example Usage:

from unscript.unscript import clean_script

# Example 1: Latin script, no numbers or punctuation
text_latin = "Hello World! 123 مرحبا"
cleaned_latin = clean_script("Latn", text_latin)
print(cleaned_latin)
# Expected output: "Hello World"

# Example 2: Arabic script, with numbers
text_arabic = "مرحبا بالعالم 123! Hello"
cleaned_arabic = clean_script("Arab", text_arabic, {"numbers": True})
print(cleaned_arabic)
# Expected output: "مرحبا بالعالم 123"

# Example 3: Chinese script, with punctuation
text_chinese = "你好。世界！This is a test."
cleaned_chinese = clean_script("Hans", text_chinese, {"punctuation": True})
print(cleaned_chinese)
# Expected output: "你好。世界！"

# Example 4: Devanagari script, with punctuation
text_devanagari = "नमस्ते। यह है॥ 987"
cleaned_devanagari = clean_script("Deva", text_devanagari, {"punctuation": True})
print(cleaned_devanagari)
# Expected output: "नमस्ते। यह है॥"

Supported Scripts

unscript and clean_script functions support a wide range of Unicode scripts. Below is a table of the supported script codes and their common names:

Script Code	Common Name
`Latn`	Latin
`Arab`	Arabic
`Hebr`	Hebrew
`Thai`	Thai
`Khmr`	Khmer
`Hang`	Hangul (Korean)
`Hans`	Han (Simplified Chinese)
`Jpan`	Japanese (Hiragana & Katakana, Han)
`Cyrl`	Cyrillic
`Geor`	Georgian
`Deva`	Devanagari
`Beng`	Bengali
`Gujr`	Gujarati
`Guru`	Gurmukhi
`Ethi`	Ethiopic
`Grek`	Greek
`Taml`	Tamil
`Mlym`	Malayalam
`Telu`	Telugu
`Knda`	Kannada
`Orya`	Oriya
`Sinh`	Sinhala
`Mymr`	Myanmar
`Laoo`	Lao
`Tibt`	Tibetan
`Armn`	Armenian
`Thaa`	Thaana
`Mong`	Mongolian
`Viet`	Vietnamese (Latin Extended)
`Brai`	Braille
`Tfng`	Tifinagh
`Hant`	Han (Traditional Chinese)
`Cans`	Canadian Aboriginal Syllabics
`Cher`	Cherokee
`Goth`	Gothic
`Olck`	Ol Chiki
`Mtei`	Meetei Mayek
`Syrc`	Syriac
`Tale`	Tai Le
`Yiii`	Yi

Contributing

We welcome contributions to Unscript! If you'd like to contribute, please follow these steps:

Fork the repository on GitHub.
Clone your forked repository to your local machine.
Create a new branch for your feature or bug fix: git checkout -b feature/your-feature-name or git checkout -b bugfix/your-bug-fix.
Make your changes and write clear, concise commit messages.
Write and run tests to ensure your changes work as expected and don't introduce regressions. We do not use mocks in our tests.
Ensure all tests pass by running python -m unittest from the project root.
Push your changes to your forked repository.
Open a Pull Request to the master branch of this repository, describing your changes in detail.

License

Unscript is released under the MIT License. See the LICENSE file for more details.

Contributors

Omar Kamali

Project details

Release history Release notifications | RSS feed

0.1.3

Nov 15, 2025

0.1.2

Nov 13, 2025

0.1.1

Nov 1, 2025

0.1.0

Oct 29, 2025

0.0.4

Jul 20, 2025

0.0.3

Jul 20, 2025

This version

0.0.2

Jul 19, 2025

0.0.1

Jul 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unscript-0.0.2.tar.gz (20.7 kB view details)

Uploaded Jul 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

unscript-0.0.2-py3-none-any.whl (13.4 kB view details)

Uploaded Jul 19, 2025 Python 3

File details

Details for the file unscript-0.0.2.tar.gz.

File metadata

Download URL: unscript-0.0.2.tar.gz
Upload date: Jul 19, 2025
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for unscript-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`a315c62de7402b116b562a59c51710cf10ccaf9e8fdaee51b991fb4e775edd9c`
MD5	`8367fb66ca93a28a6cc46da21c2c3037`
BLAKE2b-256	`499266be3012dcbfb2a9cf73c6d5545dff3247ecd8ff4e7a20b6d70061ebdbaf`

See more details on using hashes here.

File details

Details for the file unscript-0.0.2-py3-none-any.whl.

File metadata

Download URL: unscript-0.0.2-py3-none-any.whl
Upload date: Jul 19, 2025
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for unscript-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6452c3f2bb2cae38b05e5655c99f785fd59463c81bd10974fa0c8fcd49c6bda`
MD5	`ff31c48f86a00b54e79875077fd55cab`
BLAKE2b-256	`f53ae7a17ed85b76455252f4ac2cd8c69e3dc69c68e911710a69c6fabe706dde`

See more details on using hashes here.

unscript 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Unscript: Multilingual Text Cleaning

Installation

Quick Start

Functions

`unscript(script: str, text: str, config: dict = None, lowercase: bool = True) -> str`

`clean_text(text: str, lowercase: bool = True) -> str`

`clean_script(script: str, text: str, config: dict = None) -> str`

Supported Scripts

Contributing

License

Contributors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes