Skip to main content

An easy to use tool for Data Preprocessing specially for Text Preprocessing

Project description

Data Preprocessors

An easy-to-use tool for Data Preprocessing especially for Text Preprocessing

Downloads

Table of Contents

Installation

Install the latest stable release
For windows

pip install -U data-preprocessors

For Linux/WSL2

pip3 install -U data-preprocessors

Quick Start

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

>> bla bla bla bla

Features

Split Textfile

This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle and seed value, you can randomly shuffle the lines of your text files.

from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
    main_file_path="example.txt",
    train_file_path="splitted/train.txt",
    val_file_path="splitted/val.txt",
    test_file_path="splitted/test.txt",
    train_size=0.6,
    val_size=0.2,
    test_size=0.2,
    shuffle=True,
    seed=42
)

# Total lines:  500
# Train set size:  300
# Validation set size:  100
# Test set size:  100

Separate Parallel Corpus

By using this function, you will be able to easily separate src_tgt_file into separated src_file and tgt_file.

from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")

Decontracting Words from Sentence

tp.decontracting_words(sentence)

Remove Punctuation

By using this function, you will be able to remove the punction of a single line of a text file.

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

# bla bla bla bla

Space Punctuation

By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)

# bla bla bla bla

Text File to List

Convert any text file into list.

 mylist= tp.text2list(myfile_path="myfile.txt")

List to Text File

Convert any list into a text file (filename.txt)

tp.list2text(mylist=mylist, myfile_path="myfile.txt")

Count Characters of a Sentence

This function will help to count the total characters of a sentence.

tp.count_chars(myfile="file.txt")

Convert Excel to Multiple Text Files

This function will help to Convert an Excel file's columns into multiple text files.

tp.excel2multitext(excel_file_path="",
                    column_names=None,
                    src_file="",
                    tgt_file="",
                    aligns_file="",
                    separator="|||",
                    src_tgt_file="",
                    )

Apply a function in whole text file

In the place of function_name you can use any function and that function will be applied in the full/whole text file.

from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
    function_name, 
    myfile_path="myfile.txt", 
    modified_file_path="modified_file.txt"
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_preprocessors-0.41.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

data_preprocessors-0.41.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file data_preprocessors-0.41.0.tar.gz.

File metadata

  • Download URL: data_preprocessors-0.41.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1022-azure

File hashes

Hashes for data_preprocessors-0.41.0.tar.gz
Algorithm Hash digest
SHA256 cfcbc5c75dd9b40bad6fe1e228ead410ac26561d4208e3f782dbb1d675a6a7ad
MD5 2932c29869e62fd8c33791c9f3d4fafe
BLAKE2b-256 71e3d8212edd3d3e3e3929a33dab84169c7ead10eb3124bd0a59c91996a97230

See more details on using hashes here.

File details

Details for the file data_preprocessors-0.41.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_preprocessors-0.41.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2b557e416405e9998b84752aa75bc7c3ec344d0252a27fec3a0f3fe1f24db69
MD5 0fa63d2c8d280dc0da5be2b6179b5b9f
BLAKE2b-256 896922db1a2cfd131ff09de9a8a3ebcd9aa0c9507c5dd64c249738a7ce90bec4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page