An easy to use tool for Data Preprocessing specially for Text Preprocessing
Project description
Table of Contents
- Installation
- Quick Start
- Features
- Split Textfile
- Build Parallel Corpus
- Separate Parallel Corpus
- Decontruct Words of Sentence
- Remove Punctuation
- Space Punctuation
- Text File to List
- Text File to Dataframe
- List to Text File
- Remove File
- Count Characters of a Sentence
- Count Words of Sentence
- Count No of Lines in a Text File
- Convert Excel to Multiple Text Files
- Merge Multiple Text Files
- Apply Any Function in a Full Text File
Installation
Install the latest stable release
For windows
pip install -U data-preprocessors
For Linux/WSL2
pip3 install -U data-preprocessors
Quick Start
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
>> bla bla bla bla
Features
Split Textfile
This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle
and seed
value, you can randomly shuffle the lines of your text files.
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
main_file_path="example.txt",
train_file_path="splitted/train.txt",
val_file_path="splitted/val.txt",
test_file_path="splitted/test.txt",
train_size=0.6,
val_size=0.2,
test_size=0.2,
shuffle=True,
seed=42
)
# Total lines: 500
# Train set size: 300
# Validation set size: 100
# Test set size: 100
Separate Parallel Corpus
By using this function, you will be able to easily separate src_tgt_file
into separated src_file
and tgt_file
.
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")
Decontracting Words from Sentence
tp.decontracting_words(sentence)
Remove Punctuation
By using this function, you will be able to remove the punction of a single line of a text file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
# bla bla bla bla
Space Punctuation
By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)
# bla bla bla bla
Text File to List
Convert any text file into list.
mylist= tp.text2list(myfile_path="myfile.txt")
List to Text File
Convert any list into a text file (filename.txt)
tp.list2text(mylist=mylist, myfile_path="myfile.txt")
Count Characters of a Sentence
This function will help to count the total characters of a sentence.
tp.count_chars(myfile="file.txt")
Convert Excel to Multiple Text Files
This function will help to Convert an Excel file's columns into multiple text files.
tp.excel2multitext(excel_file_path="",
column_names=None,
src_file="",
tgt_file="",
aligns_file="",
separator="|||",
src_tgt_file="",
)
Apply a function in whole text file
In the place of function_name
you can use any function and that function will be applied in the full/whole text file.
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
function_name,
myfile_path="myfile.txt",
modified_file_path="modified_file.txt"
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_preprocessors-0.58.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47cc200da7a7e0428c94193f02a9a005fff686723e02e23e458c8f217a22e483 |
|
MD5 | a1224838c8642c9d643ec462b5c71492 |
|
BLAKE2b-256 | 6162a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982 |
Hashes for data_preprocessors-0.58.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e87feaa660ff5388fdadeaf4c142d42d96389d0df45c460b1d042786cf65cb9 |
|
MD5 | ae589fa9a023da3834e68c1b8c2aa1d9 |
|
BLAKE2b-256 | 932e8ee9098af9bb8e92c3d769074739bcb27e63f7b74e24a635a7f669d54d39 |