A package to preprocess text data
Project description
📄 bm-preprocessing
bm-preprocessing is a Python package providing easy-to-use NLP preprocessing utilities built on top of NLTK and pandas. It helps you clean, normalize, tokenize, and vectorize text data efficiently using a modular pipeline.
✨ Features
- Text cleaning and normalization
- Tokenization and stopword removal
- Lemmatization
- TF-IDF and Bag-of-Words vectorization
- Pipeline-based preprocessing
- Built on NLTK and pandas
- Scikit-learn–style API
📦 Installation
Install from PyPI:
pip install bm-preprocessing
🚀 Quick Start
Basic Usage with Pipeline
from bm_preprocessing import (
TextCleaner,
Tokenizer,
Normalizer,
StopwordFilter,
Lemmatizer,
Vectorizer,
Pipeline
)
# Sample documents
documents = [
"This is an example document! It has punctuation & numbers: 123.",
"Natural Language Processing is AMAZING!!!",
"Preprocessing text is very important for NLP tasks."
]
# Create preprocessing components
cleaner = TextCleaner(
lowercase=True,
remove_punctuation=True,
remove_numbers=True,
strip_whitespace=True
)
tokenizer = Tokenizer(method="word")
normalizer = Normalizer(
expand_contractions=True,
fix_unicode=True
)
stopword_filter = StopwordFilter(language="english")
lemmatizer = Lemmatizer(method="wordnet")
vectorizer = Vectorizer(
method="tfidf",
max_features=5000,
ngram_range=(1, 2)
)
# Build pipeline
preprocessing_pipeline = Pipeline([
cleaner,
normalizer,
tokenizer,
stopword_filter,
lemmatizer,
vectorizer
])
# Run preprocessing
processed_data = preprocessing_pipeline.fit_transform(documents)
# Inspect output
print("Processed Features Shape:", processed_data.shape)
print("Sample Vector:", processed_data[0])
🧩 Step-by-Step Processing (Without Pipeline)
You can also run each step manually:
from bm_preprocessing import (
TextCleaner,
Tokenizer,
StopwordFilter,
Lemmatizer,
Vectorizer
)
docs = [
"Machine learning is fun!",
"Text preprocessing improves results."
]
# Initialize tools
cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
stopwords = StopwordFilter("english")
lemmatizer = Lemmatizer()
vectorizer = Vectorizer(method="bow")
# Process
cleaned = [cleaner.clean(d) for d in docs]
tokens = [tokenizer.tokenize(d) for d in cleaned]
filtered = [stopwords.remove(t) for t in tokens]
lemmatized = [lemmatizer.lemmatize(t) for t in filtered]
vectors = vectorizer.fit_transform(lemmatized)
print(vectors)
🛠️ Components Overview
| Component | Description |
|---|---|
TextCleaner |
Removes noise and formats text |
Tokenizer |
Splits text into tokens |
Normalizer |
Standardizes text |
StopwordFilter |
Removes common filler words |
Lemmatizer |
Converts words to base form |
Vectorizer |
Converts text to numeric features |
Pipeline |
Chains components into a workflow |
🧠 Deep Learning Preparation Example
For sequence models:
from bm_preprocessing import (
TextCleaner,
Tokenizer,
SequencePadder,
VocabularyBuilder
)
texts = [
"Deep learning for NLP",
"Transformers are powerful"
]
cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
vocab = VocabularyBuilder(max_size=10000)
padder = SequencePadder(max_length=50)
# Clean
cleaned = [cleaner.clean(t) for t in texts]
# Tokenize
tokens = [tokenizer.tokenize(t) for t in cleaned]
# Build vocabulary
vocab.fit(tokens)
# Encode
encoded = [vocab.encode(t) for t in tokens]
# Pad
padded = padder.pad(encoded)
print(padded)
📚 Requirements
- Python 3.8+
- nltk
- pandas
- scikit-learn (for vectorization)
Install dependencies automatically with:
pip install bm-preprocessing
📂 Project Structure
bm_preprocessing/
│
├── cleaning.py
├── tokenization.py
├── normalization.py
├── filtering.py
├── lemmatization.py
├── vectorization.py
├── pipeline.py
└── __init__.py
🤝 Contributing
Contributions are welcome!
- Fork the repository
- Create a new branch
- Commit your changes
- Open a pull request
📄 License
This project is licensed under the MIT License.
📬 Support
If you encounter any issues or have feature requests, please open an issue on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bm_preprocessing-1.4.9.tar.gz.
File metadata
- Download URL: bm_preprocessing-1.4.9.tar.gz
- Upload date:
- Size: 54.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a67671ebaa956a52a7bee916d6e3453f92c117084d1c3d1f6e9a1a36e84a9711
|
|
| MD5 |
512a875b67bd26b98eddc58522c7e7b7
|
|
| BLAKE2b-256 |
ebe4e3cc8ef1063be93d203ecd55f2ff54ab278e300b787ba7bd722b36946afa
|
File details
Details for the file bm_preprocessing-1.4.9-py3-none-any.whl.
File metadata
- Download URL: bm_preprocessing-1.4.9-py3-none-any.whl
- Upload date:
- Size: 66.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e480c7d83369c1d3bde761e46fd9e1cc46d0f66752d13d7088d56c7e22e9dcbf
|
|
| MD5 |
5b9ac6b72ef2af215535bea3fef63bf1
|
|
| BLAKE2b-256 |
3ae3593c8638dd4c4f7d1a8384f72fed1f4f491eca8b8b8d7f2fc217f827e45f
|