Skip to main content

A package to evaluate bm metrics

Project description

📄 bm-eval-metrics

bm-eval-metrics is a Python package providing easy-to-use evaluation metrics and utilities for Machine Learning.It helps you access and view source code for various ML algorithms efficiently.

✨ Features

  • Text cleaning and normalization
  • Tokenization and stopword removal
  • Lemmatization
  • TF-IDF and Bag-of-Words vectorization
  • Pipeline-based preprocessing
  • Built on NLTK and pandas
  • Scikit-learn–style API

📦 Installation

Install from PyPI:

pip install bm-preprocessing

🚀 Quick Start

Basic Usage with Pipeline

from bm_preprocessing import (
    TextCleaner,
    Tokenizer,
    Normalizer,
    StopwordFilter,
    Lemmatizer,
    Vectorizer,
    Pipeline
)

# Sample documents
documents = [
    "This is an example document! It has punctuation & numbers: 123.",
    "Natural Language Processing is AMAZING!!!",
    "Preprocessing text is very important for NLP tasks."
]

# Create preprocessing components
cleaner = TextCleaner(
    lowercase=True,
    remove_punctuation=True,
    remove_numbers=True,
    strip_whitespace=True
)

tokenizer = Tokenizer(method="word")

normalizer = Normalizer(
    expand_contractions=True,
    fix_unicode=True
)

stopword_filter = StopwordFilter(language="english")

lemmatizer = Lemmatizer(method="wordnet")

vectorizer = Vectorizer(
    method="tfidf",
    max_features=5000,
    ngram_range=(1, 2)
)

# Build pipeline
preprocessing_pipeline = Pipeline([
    cleaner,
    normalizer,
    tokenizer,
    stopword_filter,
    lemmatizer,
    vectorizer
])

# Run preprocessing
processed_data = preprocessing_pipeline.fit_transform(documents)

# Inspect output
print("Processed Features Shape:", processed_data.shape)
print("Sample Vector:", processed_data[0])

🧩 Step-by-Step Processing (Without Pipeline)

You can also run each step manually:

from bm_preprocessing import (
    TextCleaner,
    Tokenizer,
    StopwordFilter,
    Lemmatizer,
    Vectorizer
)

docs = [
    "Machine learning is fun!",
    "Text preprocessing improves results."
]

# Initialize tools
cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
stopwords = StopwordFilter("english")
lemmatizer = Lemmatizer()
vectorizer = Vectorizer(method="bow")

# Process
cleaned = [cleaner.clean(d) for d in docs]
tokens = [tokenizer.tokenize(d) for d in cleaned]
filtered = [stopwords.remove(t) for t in tokens]
lemmatized = [lemmatizer.lemmatize(t) for t in filtered]

vectors = vectorizer.fit_transform(lemmatized)

print(vectors)

🛠️ Components Overview

Component Description
TextCleaner Removes noise and formats text
Tokenizer Splits text into tokens
Normalizer Standardizes text
StopwordFilter Removes common filler words
Lemmatizer Converts words to base form
Vectorizer Converts text to numeric features
Pipeline Chains components into a workflow

🧠 Deep Learning Preparation Example

For sequence models:

from bm_preprocessing import (
    TextCleaner,
    Tokenizer,
    SequencePadder,
    VocabularyBuilder
)

texts = [
    "Deep learning for NLP",
    "Transformers are powerful"
]

cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
vocab = VocabularyBuilder(max_size=10000)
padder = SequencePadder(max_length=50)

# Clean
cleaned = [cleaner.clean(t) for t in texts]

# Tokenize
tokens = [tokenizer.tokenize(t) for t in cleaned]

# Build vocabulary
vocab.fit(tokens)

# Encode
encoded = [vocab.encode(t) for t in tokens]

# Pad
padded = padder.pad(encoded)

print(padded)

📚 Requirements

  • Python 3.8+
  • nltk
  • pandas
  • scikit-learn (for vectorization)

Install dependencies automatically with:

pip install bm-preprocessing

📂 Project Structure

bm_preprocessing/
│
├── cleaning.py
├── tokenization.py
├── normalization.py
├── filtering.py
├── lemmatization.py
├── vectorization.py
├── pipeline.py
└── __init__.py

🤝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a new branch
  3. Commit your changes
  4. Open a pull request

📄 License

This project is licensed under the MIT License.


📬 Support

If you encounter any issues or have feature requests, please open an issue on GitHub.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm_eval_metrics-1.0.3.tar.gz (34.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bm_eval_metrics-1.0.3-py3-none-any.whl (60.8 kB view details)

Uploaded Python 3

File details

Details for the file bm_eval_metrics-1.0.3.tar.gz.

File metadata

  • Download URL: bm_eval_metrics-1.0.3.tar.gz
  • Upload date:
  • Size: 34.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for bm_eval_metrics-1.0.3.tar.gz
Algorithm Hash digest
SHA256 245fc123cdeb25d2b7f9071bf0241885d1b69e8062964f7fb9e3e919e05b0e88
MD5 a6201f3b38f84b0e5687f155ffbf6a60
BLAKE2b-256 df687d95f532facc0695a6cf40137a98f59acf0e52465c0d5e5f8aa45ebf6fba

See more details on using hashes here.

File details

Details for the file bm_eval_metrics-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for bm_eval_metrics-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 808afb44913364d71c2c0c645b4590509b37c9e0b9f4bd300e7de9a475c2c567
MD5 a68b51eb49fa72e04aacf9a92f8530f9
BLAKE2b-256 be298213e6688f0ea0c5685eb7c1b61302bc650e0a558110e40aee32a7c001c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page