Skip to main content

Text Data Cleaning

Project description

Project Description

  • This package cleans the text data such as removal of HTML tags, URLs, NLTK Stopwords, numbers, punctuations.

Features

  • Remove URLs
  • Remove HTML Tags
  • Remove NLTK Stopwords
  • Remove Numbers
  • Remove Punctuations
  • Remove Additional Spaces
  • Changes to Lower Case

Installation

  • In the code notebook like IPYNB use the below command\

    !pip install py-text-data-clean
    
  • If installing from Anaconda Prompt of CMD Terminal, use the below command\

    pip install py-text-data-clean
    
  • Note:

 Check if the package version is upgraded. If the version is not upgraded, please upgrade it.

 # To check the version, run the below code
 !pip show py-text-data-clean

 # To upgrade the package, run the below code
 !pip install py-text-data-clean -U

Usage

Input:

 - List of text data - Example: ["Is the   time 12 Noon now, isn't it?", "It is a python link: https://pypi.org/"]

Output:

 - ['time noon', 'python link']

Code to clean text with a single function:

# Import the library
from pytextdataclean import textclean as tc
input_text_list = ["Is the   time 12 Noon now, isn't it?", "It is a python link: https://pypi.org/"]
result = tc.text_clean(data=input_text_list)
print(result)

Code to use each available features:

# Pass the list of text

# Example list:
input_text_list = ["Is the   time 12 Noon now, isn't it?", "It is a python link: https://pypi.org/"]

# Import the library
from pytextdataclean import textclean as tc

# To remove html tags
tc.remove_html_tags(data=input_text_list)

# To remove NLTK stop words
tc.remove_nltk_stopwords(data=input_text_list)

# To remove URLs
tc.remove_url(data=input_text_list)

# To remove punctuations
tc.remove_punctuation(data=input_text_list)

# To remove numerical digits
tc.remove_digits(data=input_text_list)

# To remove foreign languages
tc.remove_foreign_languages(data=input_text_list)

# To remove spaces
tc.remove_spaces(data=input_text_list)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_text_data_clean-0.0.4.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_text_data_clean-0.0.4-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file py_text_data_clean-0.0.4.tar.gz.

File metadata

  • Download URL: py_text_data_clean-0.0.4.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for py_text_data_clean-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f0001391b8a54e090ec28ff34ac9d8ec3e9656af5f508e90edc0a427baeb0dd4
MD5 3e3f3af7a6d007ea6c436e6766d06e18
BLAKE2b-256 c6955d0c985f0c306ce5704e8bf839a0b3b03f585c582bdb4778aa1cf63b2d45

See more details on using hashes here.

File details

Details for the file py_text_data_clean-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: py_text_data_clean-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for py_text_data_clean-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 47736b1510b5407d2babb3bc2ee43e2a9db716d402b48798935a67da761a70d3
MD5 e3b0fe4aa2d75f60f6a47b45f40322ab
BLAKE2b-256 7a09ac355a74ebb8866029fdf0e4edb28ab03e0b5d1eb85297a6843afde5aace

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page