Skip to main content

DataLabX v0.1.0b10 Pre-Release - First Public Release with Class-Based API & Flexible Data Loading.

Project description

datalabx logo

API Docs TestPyPI version Status Python License

A diagnosis-first data quality and preparation framework for real-world data.

DataLabX is a Python library designed to help you understand, diagnose, and safely prepare messy datasets - before analysis or modeling.

Most data failures donโ€™t happen during modeling. They happen earlier: during data understanding, cleaning, and unsafe transformations.

DataLabX exists to fix that.

What is DataLabX?

DataLabX is a structured framework for working with messy, real-world data.

It is designed for datasets where:

  • Values are inconsistent, invalid, or misleading

  • Missing data appears in many hidden forms

  • Column types are unclear or mixed

  • Blind automation is risky

Instead of guessing or silently coercing data, datalabx focuses on:

  • Clarity

  • Control

  • Explainability

datalabx helps you understand what your data is doing before deciding what to do with it.

Who is DataLabX for?

DataLabX is built for:

  • Analysts & Data Scientists working with messy, real-world datasets

  • Researchers & Engineers needing structured data diagnostics

  • Beginners who want safe, guided workflows

  • Advanced users who want transparency instead of black boxes

If you care about well-understood data, DataLabX is for you.

Core Philosophy

Diagnosis-first, not automation-first.

DataLabX assumes that your data is dirty by default.

Instead of hiding problems, it:

  • detects them

  • explains them

  • lets you decide what to do

DataLabX is built around a simple idea:

Different data types need different thinking

DataLabX separates workflows by data type:

  • Numerical

  • Categorical

  • Text

  • Datetime

  • (Graph data coming soon)

This keeps workflows:

  • clear

  • safe

  • reproducible

What makes DataLabX different?

  • Designed for extremely messy datasets (โ‰ˆ77โ€“90% invalid or inconsistent values)

  • Tested on datasets with 5-10 million rows

  • Type-aware diagnosis and cleaning

  • Regex-based detection of hidden issues

  • Structured, beginner-safe APIs

  • Human-friendly documentation

DataLabX combines:

  • power for advanced users

  • safety and clarity for beginners

How DataLabX Works

With DataLabX, you typically:

  • Load data

  • Diagnose structure, types, and issues

  • Analyze missingness and inconsistencies

  • Apply type-specific cleaning & preprocessing

  • Compute statistics and distributions

  • Visualize behavior and patterns

Each step is explicit, modular, and explainable.

Current Version: v0.1 (Pre-Release)

Focus in v0.1

Tabular data workflows, including:

  • Data loading (CSV, Excel, JSON, Parquet)

  • Data diagnosis & dirty data detection

  • Missingness analysis & visualization

  • Numerical & categorical workflows

  • Cleaning & preprocessing

  • Statistical computations

  • Matplotlib-based visualizations

  • Beginner-friendly documentation & workflow guides

Pandas is fully supported. Polars is used internally for performance in selected components.

Installation (v0.1 Pre-Release)

DataLabX is available on TestPyPI for early testing and feedback.

You can now Install datalabx pre-release using pip:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple datalabx-pre-release==0.1.0b10

Why this long command?

That is because DataLabX itself is downloaded from TestPyPI, while required dependencies (such as pandas) are downloaded from PyPI.

Importing DataLabX

import datalabx

Installation Video

Installation and Getting Started Video

๐Ÿ‘‰ https://youtu.be/RC4SzXxRSHk

Updating to the Latest TestPyPI Version

If you already installed an earlier pre-release version of datalabx from TestPyPI, you can upgrade to the latest test version using:

pip install --upgrade --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple datalabx-pre-release

This ensures you always get the most recent pre-release version available on TestPyPI.

โš ๏ธ Note:

This is a pre-release version and is not yet intended for production use.

Project Structure:

datalabx/
โ”‚
โ”œโ”€โ”€ datalabx/                # Main Python package
โ”‚   โ”œโ”€โ”€ tabular/
โ”‚   โ”‚   โ”œโ”€โ”€ data_loader/
โ”‚   โ”‚   โ”œโ”€โ”€ data_diagnosis/
โ”‚   โ”‚   โ”œโ”€โ”€ data_cleaning/
โ”‚   โ”‚   โ”œโ”€โ”€ data_preprocessing/
โ”‚   โ”‚   โ”œโ”€โ”€ computations/
โ”‚   โ”‚   โ”œโ”€โ”€ data_visualization/
โ”‚   โ”‚   โ”œโ”€โ”€ data_analysis/         # (To be added in v0.2)
โ”‚   โ”‚   โ””โ”€โ”€ utils/
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ graph/              # (To be added in v0.3)
โ”‚
โ”œโ”€โ”€ docs/                 # API documentation
โ”œโ”€โ”€ foundations/          # datalabx Foundational concepts
โ”œโ”€โ”€ guides/               # API Usage & Workflow Guide notebooks for each step
โ”œโ”€โ”€ assets/               # Images, logos, diagrams
โ”‚   โ””โ”€โ”€ datalabx_logo.png
โ”œโ”€โ”€ DataLabX_API_RETURN_TYPES.md     # Public API Return Types Reference
โ”œโ”€โ”€ DataLabX_DATA_HANDLING_POLICY.md # DataLabX's policy on data handling
โ”œโ”€โ”€ DataLabX_DATA_HANDLING_REPORT.md # DataLabX's current report on data handling
โ”œโ”€โ”€ CHANGELOG.md                     
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ CODE_OF_CONDUCT.md
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ MANIFEST.in
โ””โ”€โ”€ README.md

Features in v0.1:

โœ”๏ธ 1. Data Loading : CSV, Excel, JSON and Parquet, Automatic file type detection.

โœ”๏ธ 2. Data Diagnosis : Shape, columns, dtypes, memory usage, duplicates, cardinality, Numerical & Categorical diagnosis, Dirty data diagnosis.

โœ”๏ธ 3. Missingness Diagnosis and Visualization : Missing data stats, Pattern analysis, Missing data plots (via missingno).

โœ”๏ธ 4. Cleaning & Preprocessing : Numerical and Categorical workflows, Missing data handling.

โœ”๏ธ 5. Computation : Descriptive stats, distribution, outliers detection, correlation.

โœ”๏ธ 6. Visualization : Histograms, Boxplots, KDE, QQ plots, categorical plots, missingness plots(using missingno).

โœ”๏ธ 7. Documentation & Workflow Guides : Friendly documentation, visual examples, workflow guides explaining why, not just how.

๐Ÿงญ Roadmap:

v0.1 - Tabular data foundations

v0.2 - Text workflows & advanced analysis

v0.3 - Graph data workflows

v0.4 - Machine learning workflows

v0.5 - API review & stabilization

Why would I even use datalabx?

Because most data problems donโ€™t come from bad models - they come from poor data understanding.

DataLabX is built to feel like:

Someone sitting next to you, explaining what your data is doing and why.

๐Ÿค Contributions

DataLabX is in early development. Ideas, feedback, and contributions are absolutely welcome!

If youโ€™d like to contribute, please follow our contribution guidelines:

  • Read the contributing guide: CONTRIBUTING.md -> explains DataLabX's philosophy, workflow, and how to make meaningful contributions.
  • Report a bug: Use the bug report template to submit any issues or unexpected behavior.
  • Request a feature: Use the feature request template to propose new functionality.

Following these steps helps ensure your contributions align with datalabxโ€™s diagnosis-first philosophy and saves time for both - you and the maintainers.

โœ‰๏ธ Contact & Support

For questions, suggestions, feedbacks or issues related to DataLabX, you can reach us at:

Email: DataLabX@protonmail.com

We aim to respond within 72 hours.

โš ๏ธ AI Usage Disclosure

AI tools were used selectively to:

  • clarify concepts

  • explore edge cases

  • generate realistic messy datasets for testing

All core design, implementation, documentation, and decisions were made by the author.

AI was used as a support and learning tool - not as a replacement for thinking, understanding, authorship, or ownership.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalabx_pre_release-0.1.0b10.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalabx_pre_release-0.1.0b10-py3-none-any.whl (61.0 kB view details)

Uploaded Python 3

File details

Details for the file datalabx_pre_release-0.1.0b10.tar.gz.

File metadata

  • Download URL: datalabx_pre_release-0.1.0b10.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for datalabx_pre_release-0.1.0b10.tar.gz
Algorithm Hash digest
SHA256 413883bd2c2a8c38ebc82f7cef97562274b45467a92ba3f37d83479387db2c9f
MD5 6f37544a9b7bf427483404fc725e5be0
BLAKE2b-256 5ba0cd093756ef60a86579da588b2dea29173e3bc56c45291bb740e034f2a915

See more details on using hashes here.

File details

Details for the file datalabx_pre_release-0.1.0b10-py3-none-any.whl.

File metadata

File hashes

Hashes for datalabx_pre_release-0.1.0b10-py3-none-any.whl
Algorithm Hash digest
SHA256 72cd42be08c1dcb3ccb9d28a345739bb8ec1c8d9594b2f392b0443b22db77ca6
MD5 d4ba2bc2bae86089f31a62a826ee0ecb
BLAKE2b-256 1208363e4f84542347932256fb792addef6bba31ebf8ad8c6830e9b7eb87081c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page