No project description provided
Project description
Data-Pipeline
Motivation
Hate speech detection faces challenges due to the diverse manifestations of abusive language across different tasks and languages. There is no universal model, as existing solutions target specific phenomena like racial discrimination or abusive language individually. With the rise of foundation models, there is a growing need for a unified dataset that integrates various hate speech datasets to support a comprehensive solution. Additionally, the lack of multilingual data, especially for low-resource languages, further complicates model development. A flexible, scalable data processing pipeline is essential to address these challenges, streamline dataset integration, and support future model advancements in hate speech detection across languages and tasks.
Dataset-to-SQLite Pipeline
The dataset-to-SQLite pipeline is composed of modular components, each responsible for a distinct phase of the data management workflow. This design ensures flexibility, maintainability, and ease of extension across stages like configuration, data insertion, validation, and querying.
config
Module
The config
module simplifies the process of importing data files (e.g., CSV, TSV) that may not match the target database schema. A configuration file is used to map source file columns to the correct database tables, ensuring smooth integration. This module is built on a base class with an inheritance structure, allowing easy adaptation for future schema changes without breaking compatibility with the validator.
loader
Module
The loader
module is responsible for validating, formatting, and loading datasets into the database. It operates in a structured, phase-based manner:
- Validator: Ensures the integrity of the incoming datasets by checking that all required files and columns (as specified in the configuration file) are present. This prevents incomplete or corrupted data from entering the pipeline.
- Formatter: Breaks down validated datasets into multiple dataframes, formatting them to match the target database schema. This step improves clarity and efficiency in the loading process.
- Loader: Manages the data insertion process, handling both single and multi-file datasets. It ensures data integrity by controlling commit and rollback operations on a per-dataset basis.
database
Module
The database
module manages schema setup and data querying to ensure smooth integration and retrieval:
- Setup: Creates all database tables in the correct order, maintaining foreign key constraints. It also offers a reset function to clear tables when needed, simplifying schema management.
- Querying: Provides two main interfaces鈥攐ne for displaying dataset-text-label information (with optional source language details) and another for executing queries from external SQL files. Both include a
show_lines
parameter for previewing rows and support exporting query results to CSV or TSV files.
utils
Module
The utils
module includes a set of helpful tools for data analysis and selection during the dataset preparation phase:
- Distribute Tool: Analyzes the distribution of one column relative to another, helping users identify balanced or imbalanced data points, useful for dataset selection.
- Fuzzysearch Tool: Allows approximate matching within the dataset, helping locate relevant data, such as label definitions or metadata, without requiring exact queries.
- Sampling Tool: Provides three pre-configured sampling strategies to ensure balanced and representative data subsets for experimental setups.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file STITCHED-0.1.0.tar.gz
.
File metadata
- Download URL: STITCHED-0.1.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c310f2f28784f8512c8f65bd6228bc2bd29e5ea8ee2e73fe321cd4f7ee759b1f |
|
MD5 | 21e92654fe75f5c0f800ba5d7f794c16 |
|
BLAKE2b-256 | 8ce360350805f4c68d074d81c27f9988807e2df9929ae8962a1bb664a8cef2e7 |
File details
Details for the file STITCHED-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: STITCHED-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 045e68e88d17cd122f614959e36fff4b0d36619c74ea68a609a19db73fd885cc |
|
MD5 | f6c4300ba941b00dd857f30fd2b5ae36 |
|
BLAKE2b-256 | 2f07936715b497694e2c14aa0806c8ec5f81a6f84bcb7997f65d0304845f3d81 |