A simple package to create elegant nlp pipelines using sklearn.
Project description
Text cleaning Pipeline
Description
This code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation. Future code will include text-transformations like word-embedding and word-vectorization.
Example
Elegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.
from pippi import (
TransformLettersSize,
RemoveStopWords,
Lemmatize,
RemovePunctuation,
RemoveHTMLTags,
)
from sklearn.pipeline import Pipeline
import pandas as pd
pipeline = Pipeline(
steps=[
("remove_stop_words", RemoveStopWords(columns=["review","sentiment"])),
("remove_html_tags", RemoveHTMLTags(columns=df.columns.to_list())),
("uppercase_letters", TransformLettersSize(columns=["sentiment"], case_transform="upper")),
("remove_punctuation", RemovePunctuation(columns=["review"])),
]
)
output = pipeline.fit_transform(df)
df = pd.DataFrame(output, columns=["review", "sentiment"])
Pipeline Visualization:
[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] -> [RemovePunctuation]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pippi_lang-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3f510d06413b43a8a8a5ec7d1702ce608593bbe8c355bfd018565352ca5e79b |
|
MD5 | 60f5a0cb1d08db5922d3c0e4c24076cb |
|
BLAKE2b-256 | 279b916682ccd746db633150d8454d62e50d4d2096464afd52ab2535d94d3e87 |