A Python package for cleaning text data by removing noise, stopwords, duplicates, and more.
Project description
PyCleanText
PyCleanText is a simple Python package designed to clean and preprocess text data. It removes unwanted noise from raw text by handling tasks like:
- Lowercasing text
- Removing URLs, punctuation, numbers, and special characters
- Removing stopwords (common words like "the", "a", "and", etc.)
- Stripping HTML tags
- Removing duplicate consecutive words
- Generating a cleaned text file
Features
- Comprehensive cleaning: Removes unwanted elements like URLs, special characters, and stopwords.
- Normalization: Converts text to lowercase and standardizes it for analysis.
- Duplicate word removal: Cleans up consecutive duplicate words for better clarity.
- File input and output: Load raw text from a file and save the cleaned text to a new file.
Installation
You can install PyCleanText directly from the Python Package Index (PyPI):
pip install PyCleanText
Usage
from PyCleanText import PyCleanText
file_path = 'input.txt'
output_file_path = 'cleaned_output.txt'
PyCleanText(file_path, output_file_path)
OR
PyCleanText(file_path)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file PyCleanText-0.1.0.tar.gz.
File metadata
- Download URL: PyCleanText-0.1.0.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fc9df0c959f0504b577cd0bfb805f4364b982daa85dde1cff14a04d573f0b04
|
|
| MD5 |
ee3597cacac3e80398627361f489edb1
|
|
| BLAKE2b-256 |
9410079059668d9341ac91951921d06804b777297c145867068376057526100b
|
File details
Details for the file PyCleanText-0.1.0-py3-none-any.whl.
File metadata
- Download URL: PyCleanText-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6685dbb3755fe4a271c701d2e6a3e266cb6e409b03590de55c57823ccb0ff505
|
|
| MD5 |
bc9f7083aee231719417fb4d31a9342f
|
|
| BLAKE2b-256 |
9df0a31d5f42959ba3ad20e05f25ee9c416c84dc76d6284c2295b38be8345596
|