A package for cleaning and preprocessing text data

These details have not been verified by PyPI

Project description

purifytext

purifytext is a flexible and customizable text preprocessing package designed to clean and prepare text data for natural language processing (NLP) tasks. It includes various text preprocessing steps such as removing HTML tags, lowercasing text, removing URLs, emojis, punctuation, special characters, numbers, expanding contractions, removing stopwords, stemming, and lemmatizing.

Installation

You can install purifytext and its dependencies using pip:

pip install purifytext

This command will automatically install the required dependencies: numpy, pandas, nltk, beautifulsoup4, and contractions.

Additionally, ensure that you download the necessary NLTK data:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

Usage

Here's an example of how to use the purifytext package:

import pandas as pd
from purifytext import clean_text

# Example DataFrame
data = {
    'text_column': [
        '<p>Hello World! Visit http://example.com ðŸ˜Š</p>',
        'Python is awesome!!! 123456'
    ]
}

df = pd.DataFrame(data)

# Cleaning the text
df_cleaned = clean_text(df, 'text_column', remove_HTML=True, lowercase=True, remove_urls=True,
                        remove_emojis=True, remove_punctuation=True, remove_special_characters=True,
                        remove_numbers=True, remove_whitespace=True, expand_contractions=True,
                        remove_stopwords=True, stemming=False, lemmatizing=True)

print(df_cleaned)

Functions

text_lower(text): Lowercases the input text.
remove_punctuation(text): Removes punctuation from the text.
remove_number(text): Removes numbers from the text.
remove_whitespace(text): Removes extra whitespace from the text.
remove_contractions(text): Expands contractions in the text.
remove_HTML_tag(text): Removes HTML tags from the text.
remove_stopwords(text): Removes stopwords from the text.
lemmatize_words(text): Lemmatizes the words in the text.
stem_words(text): Stems the words in the text.
remove_urls(text): Removes URLs from the text.
remove_emojis(text): Removes emojis from the text.
remove_special_characters(text): Removes special characters from the text.

Cleaning Options

The clean_text function allows you to specify which cleaning steps to perform using the following parameters (default values are shown):

remove_HTML=True: Remove HTML tags.
lowercase=True: Lowercase the text.
remove_urls=True: Remove URLs.
remove_emojis=True: Remove emojis.
remove_punctuation=True: Remove punctuation.
remove_special_characters=True: Remove special characters.
remove_numbers=True: Remove numbers.
remove_whitespace=True: Remove extra whitespace.
expand_contractions=True: Expand contractions.
remove_stopwords=True: Remove stopwords.
stemming=False: Stem words.
lemmatizing=True: Lemmatize words.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

This package is developed by Aman Kumar Jha. For any questions or suggestions, please contact us at vats.amankumarjha2002@gmail.com.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Jul 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purifytext-0.1.0.tar.gz (4.4 kB view details)

Uploaded Jul 17, 2024 Source

File details

Details for the file purifytext-0.1.0.tar.gz.

File metadata

Download URL: purifytext-0.1.0.tar.gz
Upload date: Jul 17, 2024
Size: 4.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for purifytext-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0ab4e76d34d780e2bff91d30c2fe83db96cb85bc4bd8bbb4e26b332d0acbf68b`
MD5	`c7fbffb02ae6d03402eefebbc37e2432`
BLAKE2b-256	`4d74aafc549ee6bd23f7479d3f9468dafbac456294d8ebbfece44eb0fe9df88c`

See more details on using hashes here.

purifytext 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers