Skip to main content

A Python package for cleaning text data by removing noise, stopwords, duplicates, and more.

Project description

PyCleanText

PyCleanText is a simple Python package designed to clean and preprocess text data. It removes unwanted noise from raw text by handling tasks like:

  • Lowercasing text
  • Removing URLs, punctuation, numbers, and special characters
  • Removing stopwords (common words like "the", "a", "and", etc.)
  • Stripping HTML tags
  • Removing duplicate consecutive words
  • Generating a cleaned text file

Features

  • Comprehensive cleaning: Removes unwanted elements like URLs, special characters, and stopwords.
  • Normalization: Converts text to lowercase and standardizes it for analysis.
  • Duplicate word removal: Cleans up consecutive duplicate words for better clarity.
  • File input and output: Load raw text from a file and save the cleaned text to a new file.

Installation

You can install PyCleanText directly from the Python Package Index (PyPI):

pip install PyCleanText

Usage

from PyCleanText import PyCleanText

file_path = 'input.txt'  
output_file_path = 'cleaned_output.txt' 

PyCleanText(file_path, output_file_path)

OR

PyCleanText(file_path)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyCleanText-0.1.0.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PyCleanText-0.1.0-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file PyCleanText-0.1.0.tar.gz.

File metadata

  • Download URL: PyCleanText-0.1.0.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for PyCleanText-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9fc9df0c959f0504b577cd0bfb805f4364b982daa85dde1cff14a04d573f0b04
MD5 ee3597cacac3e80398627361f489edb1
BLAKE2b-256 9410079059668d9341ac91951921d06804b777297c145867068376057526100b

See more details on using hashes here.

File details

Details for the file PyCleanText-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: PyCleanText-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for PyCleanText-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6685dbb3755fe4a271c701d2e6a3e266cb6e409b03590de55c57823ccb0ff505
MD5 bc9f7083aee231719417fb4d31a9342f
BLAKE2b-256 9df0a31d5f42959ba3ad20e05f25ee9c416c84dc76d6284c2295b38be8345596

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page