Fast pandas-friendly text preprocessing utilities.
Project description
rapidtextprep
Fast, reusable, pandas-friendly text preprocessing utilities for NLP and machine learning workflows.
rapidtextprep provides a small public API for common text preprocessing tasks:
cleaning, normalization, stopword removal, URL/email extraction, frequency-based
word removal, feature generation, and lookup-based lemmatization. It works with
both plain Python strings and pandas.Series where vectorized processing makes
sense.
Features
- Lowercasing and whitespace normalization.
- English contraction expansion.
- Social-media abbreviation expansion.
- Accent normalization.
- HTML tag, email, URL, retweet marker, and special character removal.
- Stopword counting and removal with sentiment-aware default keep words.
- URL and email extraction.
- Basic text feature generation for pandas dataframes.
- Common and rare word removal from corpus-level word counts.
- spaCy lookup-based lemmatization without requiring
en_core_web_smoren_core_web_md. - Chunked processing for large pandas Series.
- Optional thread or process based parallel chunk cleaning.
- Async wrapper functions for async applications.
Installation
Install from PyPI:
pip install rapidtextprep
Or with uv:
uv pip install rapidtextprep
The package declares its runtime dependencies in pyproject.toml, so numpy,
pandas, scikit-learn, spacy, and spacy-lookups-data are installed
automatically.
Quick Start
from rapidtextprep import clean_text, remove_stopwords
text = "RT @User: I CAN'T believe this cafe is 50% OFF!!! Visit https://shop.com"
cleaned = clean_text(text)
print(cleaned)
without_stopwords = remove_stopwords("this movie is not good but very emotional")
print(without_stopwords)
Pandas Usage
Most cleaning and normalization functions accept a pandas.Series and preserve
the original index.
import pandas as pd
from rapidtextprep import clean_text
texts = pd.Series(
[
"I CAN'T wait!!!",
"Visit https://example.com now",
"RT @user: hello #NLP",
],
name="text",
)
cleaned = clean_text(texts)
print(cleaned)
Complete Cleaning Pipeline
clean_text is the beginner-friendly alias for
get_complete_text_clean_up_batch.
from rapidtextprep import clean_text
cleaned = clean_text(
texts,
keep_stopwords=None,
extra_stopwords={"example"},
use_lemmatization=False,
chunk_size=100_000,
)
The pipeline order is:
- Lowercase text.
- Expand contractions.
- Expand social-media abbreviations.
- Normalize accented characters.
- Remove HTML tags.
- Remove email addresses.
- Remove URLs.
- Remove standalone retweet markers.
- Remove special characters.
- Remove stopwords.
- Optionally lemmatize text.
- Normalize whitespace.
Parallel Processing
For large pandas.Series inputs, enable parallel chunk cleaning with n_jobs.
from rapidtextprep import clean_text
cleaned = clean_text(
texts,
chunk_size=20_000,
n_jobs=5,
)
By default, parallel cleaning uses threads:
cleaned = clean_text(
texts,
chunk_size=20_000,
n_jobs=5,
parallel_backend="thread",
)
For CPU-heavy workloads, you can opt into process-based chunk cleaning:
cleaned = clean_text(
texts,
chunk_size=20_000,
n_jobs=5,
parallel_backend="process",
)
Guidance:
- Use
n_jobs=1for sequential execution. - Use
n_jobs=-1to use all available CPU cores. - Use
parallel_backend="thread"for lower overhead. - Use
parallel_backend="process"only after benchmarking on real data. - On Windows, process startup and pandas chunk serialization can be expensive.
When use_lemmatization=True, rapidtextprep parallelizes the pre-lemmatization
cleaning stages and then runs spaCy lemmatization once over the combined Series.
This avoids sharing the cached spaCy pipeline across worker threads or processes.
Lemmatization
Lemmatization uses spaCy's lookup lemmatizer:
from rapidtextprep import lemmatize_text
lemmatized = lemmatize_text("cars were running faster")
print(lemmatized)
No downloadable spaCy model is required. The package uses:
spacy.blank("en")
with lookup lemmatization powered by spacy-lookups-data.
You can enable lemmatization in the complete cleaning pipeline:
cleaned = clean_text(
texts,
use_lemmatization=True,
lemmatize_batch_size=5_000,
n_process=1,
)
For spaCy's own multiprocessing during lemmatization, increase n_process:
cleaned = clean_text(
texts,
use_lemmatization=True,
n_process=2,
)
Async Usage
The async functions run the synchronous implementation in the event loop's default executor. This is useful when calling rapidtextprep from an async application, but it does not make CPU-bound work asynchronous internally.
from rapidtextprep import async_clean_text
cleaned = await async_clean_text(
texts,
chunk_size=20_000,
n_jobs=5,
parallel_backend="process",
)
Available async wrappers:
async_clean_textasync_get_complete_text_clean_up_batchasync_clean_text_column_in_chunks
Common Utilities
Normalization
from rapidtextprep import (
expand_abbreviations,
expand_contractions,
lowercase_text,
normalize_whitespace,
remove_accented_chars,
)
lowercase_text("Hello WORLD")
expand_contractions("i'm sure he won't go")
expand_abbreviations("btw idk irl")
remove_accented_chars("cafe")
normalize_whitespace(" hello world ")
Cleaning
from rapidtextprep import (
remove_email,
remove_html_tags,
remove_rt,
remove_special_characters,
remove_urls,
)
remove_email("contact test@example.com")
remove_urls("visit https://example.com now")
remove_rt("RT @user: hello")
remove_html_tags("<p>Hello</p>")
remove_special_characters("hello!!! #nlp")
Extraction
from rapidtextprep import get_email, get_urls
email_count, emails = get_email("mail test@example.com")
url_count, urls = get_urls("visit https://example.com")
Feature Generation
import pandas as pd
from rapidtextprep import get_basic_features
df = pd.DataFrame({"text": ["python is great #nlp"]})
features = get_basic_features(df, "text")
Generated columns:
char_countword_countavg_word_lengthstopwords_counthashtag_countmentions_countdigit_count
Frequency-Based Cleanup
import pandas as pd
from rapidtextprep import get_value_counts, remove_common_word, remove_rarewords
texts = pd.Series(["python is fast", "python is popular"])
word_counts = get_value_counts(texts)
remove_common_word("python is fast", word_counts, n_words=1)
remove_rarewords("python is popular", word_counts, n_words=1)
Public API Overview
Recommended beginner-friendly names:
clean_textasync_clean_textlemmatize_textlowercase_textexpand_contractionsexpand_abbreviationsnormalize_whitespace
Compatibility names are also preserved, including:
get_complete_text_clean_up_batchclean_text_column_in_chunksget_lemmatize_text_fastget_lower_caseget_contraction_to_expansionget_expand_abbreviationsremove_multiple_whitespaces
Benchmarking
A simple benchmark script is included for local testing:
uv run python benchmarks/benchmark_pipeline.py --rows 100000 --chunk-size 20000 --n-jobs 5 --backend thread
Compare thread and process backends:
uv run python benchmarks/benchmark_pipeline.py --rows 100000 --chunk-size 20000 --n-jobs 5 --backend thread --lemmatize
uv run python benchmarks/benchmark_pipeline.py --rows 100000 --chunk-size 20000 --n-jobs 5 --backend process --lemmatize
Benchmark results depend heavily on text length, CPU count, operating system, chunk size, and whether lemmatization is enabled.
Development
Clone the repository and install dependencies:
git clone https://github.com/suraj-yadav-aiml/rapidtextprep.git
cd rapidtextprep
uv sync
Run formatting, linting, and tests:
uv run ruff format .
uv run ruff check .
uv run pytest
Build the package:
uv build
Project Structure
rapidtextprep/
src/
rapidtextprep/
cleaning.py
normalization.py
extraction.py
features.py
frequency.py
lemmatization.py
pipeline.py
stopwords.py
data/
tests/
benchmarks/
pyproject.toml
README.md
LICENSE
Requirements
- Python 3.11 or newer.
numpypandasscikit-learnspacyspacy-lookups-data
These dependencies are installed automatically when installing the package.
License
This project is licensed under the MIT License. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rapidtextprep-0.1.1.tar.gz.
File metadata
- Download URL: rapidtextprep-0.1.1.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c66d362ad89eaa73a38cc3f25daf1d34194549e9cb208ab086021d96831a5cf3
|
|
| MD5 |
4ed7a5a219ac1472124f09c466c456ba
|
|
| BLAKE2b-256 |
da8e76ed345daff0e8c69bf33ff66d1794417f1db40cae7f158e5c92e2d654f4
|
File details
Details for the file rapidtextprep-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rapidtextprep-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96c9d6b4e65387e62aa870a1763739b590e412611694ea1c5434c738617c8680
|
|
| MD5 |
42a4d8546eed18ed8f1a4d4df7b192ee
|
|
| BLAKE2b-256 |
84e129e5dd40d947c37e21799bc046bd13c4a2a50eb5349077ee22e770b983f9
|