Skip to main content

An easy to use tool for Data Preprocessing specially for Text Preprocessing

Project description

Data Preprocessors

An easy-to-use tool for Data Preprocessing especially for Text Preprocessing

Downloads

Table of Contents

Installation

Install the latest stable release
For windows

pip install -U data-preprocessors

For Linux/WSL2

pip3 install -U data-preprocessors

Quick Start

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

>> bla bla bla bla

Features

Split Textfile

This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle and seed value, you can randomly shuffle the lines of your text files.

from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
    main_file_path="example.txt",
    train_file_path="splitted/train.txt",
    val_file_path="splitted/val.txt",
    test_file_path="splitted/test.txt",
    train_size=0.6,
    val_size=0.2,
    test_size=0.2,
    shuffle=True,
    seed=42
)

# Total lines:  500
# Train set size:  300
# Validation set size:  100
# Test set size:  100

Separate Parallel Corpus

By using this function, you will be able to easily separate src_tgt_file into separated src_file and tgt_file.

from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")

Decontracting Words from Sentence

tp.decontracting_words(sentence)

Remove Punctuation

By using this function, you will be able to remove the punction of a single line of a text file.

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

# bla bla bla bla

Space Punctuation

By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.

from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)

# bla bla bla bla

Text File to List

Convert any text file into list.

 mylist= tp.text2list(myfile_path="myfile.txt")

List to Text File

Convert any list into a text file (filename.txt)

tp.list2text(mylist=mylist, myfile_path="myfile.txt")

Count Characters of a Sentence

This function will help to count the total characters of a sentence.

tp.count_chars(myfile="file.txt")

Convert Excel to Multiple Text Files

This function will help to Convert an Excel file's columns into multiple text files.

tp.excel2multitext(excel_file_path="",
                    column_names=None,
                    src_file="",
                    tgt_file="",
                    aligns_file="",
                    separator="|||",
                    src_tgt_file="",
                    )

Apply a function in whole text file

In the place of function_name you can use any function and that function will be applied in the full/whole text file.

from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
    function_name, 
    myfile_path="myfile.txt", 
    modified_file_path="modified_file.txt"
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_preprocessors-0.58.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

data_preprocessors-0.58.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file data_preprocessors-0.58.0.tar.gz.

File metadata

  • Download URL: data_preprocessors-0.58.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1025-azure

File hashes

Hashes for data_preprocessors-0.58.0.tar.gz
Algorithm Hash digest
SHA256 47cc200da7a7e0428c94193f02a9a005fff686723e02e23e458c8f217a22e483
MD5 a1224838c8642c9d643ec462b5c71492
BLAKE2b-256 6162a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982

See more details on using hashes here.

File details

Details for the file data_preprocessors-0.58.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_preprocessors-0.58.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e87feaa660ff5388fdadeaf4c142d42d96389d0df45c460b1d042786cf65cb9
MD5 ae589fa9a023da3834e68c1b8c2aa1d9
BLAKE2b-256 932e8ee9098af9bb8e92c3d769074739bcb27e63f7b74e24a635a7f669d54d39

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page