A package to preprocess text data

Project description

📄 `bm-preprocessing`

bm-preprocessing is a Python package providing easy-to-use NLP preprocessing utilities built on top of NLTK and pandas. It helps you clean, normalize, tokenize, and vectorize text data efficiently using a modular pipeline.

✨ Features

Text cleaning and normalization
Tokenization and stopword removal
Lemmatization
TF-IDF and Bag-of-Words vectorization
Pipeline-based preprocessing
Built on NLTK and pandas
Scikit-learn–style API

📦 Installation

Install from PyPI:

pip install bm-preprocessing

🚀 Quick Start

Basic Usage with Pipeline

from bm_preprocessing import (
    TextCleaner,
    Tokenizer,
    Normalizer,
    StopwordFilter,
    Lemmatizer,
    Vectorizer,
    Pipeline
)

# Sample documents
documents = [
    "This is an example document! It has punctuation & numbers: 123.",
    "Natural Language Processing is AMAZING!!!",
    "Preprocessing text is very important for NLP tasks."
]

# Create preprocessing components
cleaner = TextCleaner(
    lowercase=True,
    remove_punctuation=True,
    remove_numbers=True,
    strip_whitespace=True
)

tokenizer = Tokenizer(method="word")

normalizer = Normalizer(
    expand_contractions=True,
    fix_unicode=True
)

stopword_filter = StopwordFilter(language="english")

lemmatizer = Lemmatizer(method="wordnet")

vectorizer = Vectorizer(
    method="tfidf",
    max_features=5000,
    ngram_range=(1, 2)
)

# Build pipeline
preprocessing_pipeline = Pipeline([
    cleaner,
    normalizer,
    tokenizer,
    stopword_filter,
    lemmatizer,
    vectorizer
])

# Run preprocessing
processed_data = preprocessing_pipeline.fit_transform(documents)

# Inspect output
print("Processed Features Shape:", processed_data.shape)
print("Sample Vector:", processed_data[0])

🧩 Step-by-Step Processing (Without Pipeline)

You can also run each step manually:

from bm_preprocessing import (
    TextCleaner,
    Tokenizer,
    StopwordFilter,
    Lemmatizer,
    Vectorizer
)

docs = [
    "Machine learning is fun!",
    "Text preprocessing improves results."
]

# Initialize tools
cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
stopwords = StopwordFilter("english")
lemmatizer = Lemmatizer()
vectorizer = Vectorizer(method="bow")

# Process
cleaned = [cleaner.clean(d) for d in docs]
tokens = [tokenizer.tokenize(d) for d in cleaned]
filtered = [stopwords.remove(t) for t in tokens]
lemmatized = [lemmatizer.lemmatize(t) for t in filtered]

vectors = vectorizer.fit_transform(lemmatized)

print(vectors)

🛠️ Components Overview

Component	Description
`TextCleaner`	Removes noise and formats text
`Tokenizer`	Splits text into tokens
`Normalizer`	Standardizes text
`StopwordFilter`	Removes common filler words
`Lemmatizer`	Converts words to base form
`Vectorizer`	Converts text to numeric features
`Pipeline`	Chains components into a workflow

🧠 Deep Learning Preparation Example

For sequence models:

from bm_preprocessing import (
    TextCleaner,
    Tokenizer,
    SequencePadder,
    VocabularyBuilder
)

texts = [
    "Deep learning for NLP",
    "Transformers are powerful"
]

cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
vocab = VocabularyBuilder(max_size=10000)
padder = SequencePadder(max_length=50)

# Clean
cleaned = [cleaner.clean(t) for t in texts]

# Tokenize
tokens = [tokenizer.tokenize(t) for t in cleaned]

# Build vocabulary
vocab.fit(tokens)

# Encode
encoded = [vocab.encode(t) for t in tokens]

# Pad
padded = padder.pad(encoded)

print(padded)

📚 Requirements

Python 3.8+
nltk
pandas
scikit-learn (for vectorization)

Install dependencies automatically with:

pip install bm-preprocessing

📂 Project Structure

bm_preprocessing/
│
├── cleaning.py
├── tokenization.py
├── normalization.py
├── filtering.py
├── lemmatization.py
├── vectorization.py
├── pipeline.py
└── __init__.py

🤝 Contributing

Contributions are welcome!

Fork the repository
Create a new branch
Commit your changes
Open a pull request

📄 License

This project is licensed under the MIT License.

📬 Support

If you encounter any issues or have feature requests, please open an issue on GitHub.

Project details

Release history Release notifications | RSS feed

1.6.0

Apr 16, 2026

1.5.9

Apr 16, 2026

1.5.8

Apr 15, 2026

1.5.5

Apr 14, 2026

1.5.4

Apr 14, 2026

1.4.9

Apr 12, 2026

1.4.8

Apr 12, 2026

1.4.7

Apr 5, 2026

1.4.6

Apr 5, 2026

1.4.5

Apr 5, 2026

1.4.4

Apr 1, 2026

1.4.3

Apr 1, 2026

1.4.2

Apr 1, 2026

1.4.1

Apr 1, 2026

1.4.0

Apr 1, 2026

1.3.9

Apr 1, 2026

1.3.8

Apr 1, 2026

1.3.7

Apr 1, 2026

1.3.6

Mar 12, 2026

1.3.5

Mar 12, 2026

1.3.4

Mar 12, 2026

1.3.3

Mar 12, 2026

1.3.2

Mar 12, 2026

1.3.1

Mar 12, 2026

1.3.0

Mar 7, 2026

1.2.0

Mar 7, 2026

1.1.0

Mar 7, 2026

1.0.0

Mar 7, 2026

0.9.0

Mar 7, 2026

0.8.0

Mar 7, 2026

0.7.0

Mar 7, 2026

0.6.0

Mar 7, 2026

0.5.0

Mar 7, 2026

0.4.0

Mar 7, 2026

This version

0.3.0

Mar 7, 2026

0.2.1

Feb 6, 2026

0.2.0

Feb 6, 2026

0.1.0

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm_preprocessing-0.3.0.tar.gz (10.3 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bm_preprocessing-0.3.0-py3-none-any.whl (16.7 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file bm_preprocessing-0.3.0.tar.gz.

File metadata

Download URL: bm_preprocessing-0.3.0.tar.gz
Upload date: Mar 7, 2026
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for bm_preprocessing-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`1067274ea0ce4c74865800d1a6c739532c09514ccc9cc84ed54b47a648eff877`
MD5	`7d589877b85a8408b6c166fbdd70309b`
BLAKE2b-256	`058a69277aa3db3d788a654907bbd3bdc26575584ccbf50800e768fe8db0b677`

See more details on using hashes here.

File details

Details for the file bm_preprocessing-0.3.0-py3-none-any.whl.

File metadata

Download URL: bm_preprocessing-0.3.0-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 16.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for bm_preprocessing-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e92e63fdda7aa80c31c83b01c7716bd1f57c263bdc3eceab48f10007efa0d25e`
MD5	`bae3622239897fdd1498bd35ffe3807a`
BLAKE2b-256	`1225c2030d915d63c89818e258a01a3edfae573818180953deacb45cbc6d7e06`

See more details on using hashes here.

bm-preprocessing 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

📄 `bm-preprocessing`

✨ Features

📦 Installation

🚀 Quick Start

Basic Usage with Pipeline

🧩 Step-by-Step Processing (Without Pipeline)

🛠️ Components Overview

🧠 Deep Learning Preparation Example

📚 Requirements

📂 Project Structure

🤝 Contributing

📄 License

📬 Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

bm-preprocessing 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

📄 bm-preprocessing

✨ Features

📦 Installation

🚀 Quick Start

Basic Usage with Pipeline

🧩 Step-by-Step Processing (Without Pipeline)

🛠️ Components Overview

🧠 Deep Learning Preparation Example

📚 Requirements

📂 Project Structure

🤝 Contributing

📄 License

📬 Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📄 `bm-preprocessing`