Skip to main content

Python package providing easy-to-use evaluation metrics and utilities for Machine Learning

Project description

bm-eval-metrics

bm-eval-metrics is a Python package providing easy-to-use evaluation metrics and utilities for machine learning workflows.

Features

  • Text cleaning and normalization
  • Tokenization and stopword removal
  • Lemmatization
  • TF-IDF and Bag-of-Words vectorization
  • Pipeline-based preprocessing
  • Built on NLTK and pandas
  • Scikit-learn style API

Installation

Install from PyPI:

pip install bm-eval-metrics

Quick Start

Basic Usage With Pipeline

from bm_eval_metrics import (
    TextCleaner,
    Tokenizer,
    Normalizer,
    StopwordFilter,
    Lemmatizer,
    Vectorizer,
    Pipeline,
)

# Sample documents
documents = [
    "This is an example document! It has punctuation and numbers: 123.",
    "Natural Language Processing is AMAZING!!!",
    "Preprocessing text is very important for NLP tasks.",
]

# Create preprocessing components
cleaner = TextCleaner(
    lowercase=True,
    remove_punctuation=True,
    remove_numbers=True,
    strip_whitespace=True,
)

tokenizer = Tokenizer(method="word")

normalizer = Normalizer(
    expand_contractions=True,
    fix_unicode=True,
)

stopword_filter = StopwordFilter(language="english")
lemmatizer = Lemmatizer(method="wordnet")

vectorizer = Vectorizer(
    method="tfidf",
    max_features=5000,
    ngram_range=(1, 2),
)

# Build pipeline
preprocessing_pipeline = Pipeline(
    [
        cleaner,
        normalizer,
        tokenizer,
        stopword_filter,
        lemmatizer,
        vectorizer,
    ]
)

# Run preprocessing
processed_data = preprocessing_pipeline.fit_transform(documents)

# Inspect output
print("Processed features shape:", processed_data.shape)
print("Sample vector:", processed_data[0])

Step-by-Step Processing Without Pipeline

from bm_eval_metrics import (
    TextCleaner,
    Tokenizer,
    StopwordFilter,
    Lemmatizer,
    Vectorizer,
)

docs = [
    "Machine learning is fun!",
    "Text preprocessing improves results.",
]

# Initialize tools
cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
stopwords = StopwordFilter("english")
lemmatizer = Lemmatizer()
vectorizer = Vectorizer(method="bow")

# Process
cleaned = [cleaner.clean(d) for d in docs]
tokens = [tokenizer.tokenize(d) for d in cleaned]
filtered = [stopwords.remove(t) for t in tokens]
lemmatized = [lemmatizer.lemmatize(t) for t in filtered]

vectors = vectorizer.fit_transform(lemmatized)
print(vectors)

Components Overview

Component Description
TextCleaner Removes noise and formats text
Tokenizer Splits text into tokens
Normalizer Standardizes text
StopwordFilter Removes common filler words
Lemmatizer Converts words to base form
Vectorizer Converts text to numeric features
Pipeline Chains components into a workflow

Deep Learning Preparation Example

from bm_eval_metrics import (
    TextCleaner,
    Tokenizer,
    SequencePadder,
    VocabularyBuilder,
)

texts = [
    "Deep learning for NLP",
    "Transformers are powerful",
]

cleaner = TextCleaner(lowercase=True)
tokenizer = Tokenizer()
vocab = VocabularyBuilder(max_size=10000)
padder = SequencePadder(max_length=50)

# Clean
cleaned = [cleaner.clean(t) for t in texts]

# Tokenize
tokens = [tokenizer.tokenize(t) for t in cleaned]

# Build vocabulary
vocab.fit(tokens)

# Encode
encoded = [vocab.encode(t) for t in tokens]

# Pad
padded = padder.pad(encoded)

print(padded)

Requirements

  • Python 3.11+
  • nltk
  • pandas
  • scikit-learn

Install dependencies automatically with:

pip install bm-eval-metrics

Project Structure

bm-eval-metrics/
├── cleaning.py
├── tokenization.py
├── normalization.py
├── filtering.py
├── lemmatization.py
├── vectorization.py
├── pipeline.py
└── __init__.py

Contributing

Contributions are welcome.

  1. Fork the repository.
  2. Create a new branch.
  3. Commit your changes.
  4. Open a pull request.

License

This project is licensed under the MIT License.

Support

If you encounter issues or have feature requests, open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm_eval_metrics-1.5.6.tar.gz (74.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bm_eval_metrics-1.5.6-py3-none-any.whl (90.9 kB view details)

Uploaded Python 3

File details

Details for the file bm_eval_metrics-1.5.6.tar.gz.

File metadata

  • Download URL: bm_eval_metrics-1.5.6.tar.gz
  • Upload date:
  • Size: 74.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bm_eval_metrics-1.5.6.tar.gz
Algorithm Hash digest
SHA256 17398e7bdaeed550f65f79382c7fead83c743ac1d1306d3a07ee7a0fb51b4993
MD5 5f34c61bd0489d4539cf095d3911e9dd
BLAKE2b-256 d2c87ef4e6b6bd7c1cb1d09842683e0d2ff02cf17d9878446931a71cb5d1fb01

See more details on using hashes here.

File details

Details for the file bm_eval_metrics-1.5.6-py3-none-any.whl.

File metadata

  • Download URL: bm_eval_metrics-1.5.6-py3-none-any.whl
  • Upload date:
  • Size: 90.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bm_eval_metrics-1.5.6-py3-none-any.whl
Algorithm Hash digest
SHA256 00ca9a0c846e0e26b48cec5e08818cdeb1ca3f33d7e7629281c7c43c3ef57873
MD5 26b55795252d04f5ffb1d8a0fadcd9b3
BLAKE2b-256 f37b56e53011b7acc42cba159f134d73270dc577f46a0e7167d68d9a0e120a60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page