Set of NLP preprocessing techniques with the aim of abstracting data preparation processes, in addition to performing validations and cleaning the masses.
Project description
aies-nlp-preprocessing-tk
CSV Format Documentation
Single Label Format
The single label format is designed for datasets where each document is associated with a single class. The CSV file must adhere to the following structure:
tag | text |
---|---|
CLASS | Text of document 1 |
CLASS | Text of document 2 |
CLASS | Text of document 3 |
- CLASS: Represents the label or category of the document.
- Text of document: The actual text content of the document.
Each row in the CSV file corresponds to a single document and its associated class. The CSV must contain exactly two columns: the first column for the class and the second column for the text of the document.
Multi Label Format
The multi label format is intended for datasets where each document can be associated with multiple classes. The CSV file must follow this structure:
tag | text |
---|---|
CLASS | Text of document 1 |
CLASS|CLASS|CLASS | Text of document 2 |
CLASS|CLASS | Text of document 3 |
- CLASS: Represents a single label or category of the document.
- CLASS|CLASS|CLASS: Represents multiple labels or categories separated by the
|
character. - Text of document: The actual text content of the document.
Each row in the CSV file corresponds to a single document and its associated classes. The CSV must contain exactly two columns: the first column for the classes and the second column for the text of the document. For multi-label documents, multiple classes must be separated by the |
character with no leading or trailing |
characters.
Example CSV Files
Single Label Example
greeting,"Hello!"
greeting,"Good morning!"
question,"How are you?"
Multi Label Example
greeting,"Hello!"
greeting|question,"Hi, how are you?"
question|feedback,"What do you think of this service?"
Validation Rules
- Single Label: The
tag
column must contain only one class per row. No|
character should be present. - Multi Label: The
tag
column can contain multiple classes separated by|
. Ensure there are no empty classes and no leading or trailing|
characters.
raw_tokenization
Tokenizes text data and prepares it for training a neural network model.
Text Tokenization:
The input textual data is tokenized, i.e., split into individual words or tokens. This is done using the spaCy library, which provides robust linguistic annotations.
Text Cleaning:
- Removing Punctuation: If specified, punctuation marks are removed from the text. This helps in simplifying the text and reducing noise in the data.
- Removing Stop Words: Optionally, common stop words (e.g., 'and', 'the', 'is') can be removed from the text. Stop words often carry little semantic meaning and can be safely excluded from the analysis.
Padding Sequences:
Text sequences are padded to ensure uniform length. This is necessary for feeding the data into a neural network, as they typically require fixed-size inputs. Padding is done using the Keras pad_sequences
function.
Label Encoding:
If the labels are categorical, they are encoded using either LabelEncoder
or MultiLabelBinarizer
from the scikit-learn library. This step converts textual labels into numerical representations, which are easier for the neural network to process.
Data Splitting:
Optionally, the preprocessed data can be split into training and testing sets using train_test_split
from scikit-learn. This facilitates model evaluation by providing a separate dataset for testing.
Parameters
data_frame (pandas.DataFrame)
: The input DataFrame containing 'text' and 'tag' columns.max_length (int)
: Maximum length of sequences after padding. (Important! Critical variable, depending on the value entered, can cause erroneous operation, disrupting tokenization, occurs mainly when using low values, to avoid using values above 100)split_test_size (float, optional)
: Size of the test dataset if splitting is needed. Defaults to None.remove_stop_words (bool, optional)
: Whether to remove stop words. Defaults to False.remove_punctuation (bool, optional)
: Whether to remove punctuation. Defaults to False.language (str, optional)
: Language to be used for tokenization. Defaults to "portuguese".
Returns
tuple
: If the split_test_size parameter is passed, the separation into training and test masses will be done and it will return the following results:
If split_test_size=False, it would return the datasets only from the tokenization of the texts, the labels will be hidden and the word index will be returned:X_train, X_test, y_train, y_test, word_index
X, y, word_index
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aies_nlp_preprocessing_tk-1.0.1.tar.gz
.
File metadata
- Download URL: aies_nlp_preprocessing_tk-1.0.1.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bb6cb76350279a833f55e7c9c5f7424c4c96cc189daab59016a6bea9383eb6b |
|
MD5 | ff85a2b78b759ce7f3198a48c65fa595 |
|
BLAKE2b-256 | 92c3ab87233501e1db9525bf4c0712bc1013ccc8a7dd35f499bb83f49eb30566 |
File details
Details for the file aies_nlp_preprocessing_tk-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: aies_nlp_preprocessing_tk-1.0.1-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d4313d376eb12c5f08623c4104eda05d322f275591d58cf83d3a651ca61715d |
|
MD5 | 3b793ac52f7c706766067bcff2140b06 |
|
BLAKE2b-256 | 64113555ff012c87b3abddb503ca0ae7a8e9fda807363d1c3b6c4d24e248028b |