Skip to main content

Lightweight Intelligent Data Automation Engine — plug-and-play pipelines for everyone.

Project description

DataPilot 🚀 — Your Partner in Data Automation

"The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency." — Bill Gates

Welcome to DataPilot. I'm here to guide you from raw, messy datasets to production-ready signals in just one line of code. DataPilot isn't just a library; it's an intelligent engine designed to handle the heavy lifting of data engineering so you can focus on the insight.


🎓 The DataPilot Way (Mentoring Guide)

As your guide, I recommend starting with the "Simple API." It's designed to give you professional-grade results without the complexity of manual boilerplate.

1. The "Master Brain": auto_pipeline (Adaptive)

The most powerful command in DataPilot. It analyzes your data, detects quality issues, and dynamically constructs a custom pipeline without any manual configuration. It also prints a beautiful Automation Decision Report explaining its choices.

import datapilot

# One command to rule them all
result = datapilot.auto_pipeline("messy_raw_data.csv")

2. General Data Cleaning: auto_clean

The "Gold Standard" for standard tabular data. It normalizes your column names, infers data types, fills missing values (0 for ints, "null" for strings), and removes duplicates.

# Perfect for daily reporting and BI
datapilot.auto_clean("sales_raw.xlsx", "sales_final.csv")

3. Machine Learning Ready: auto_ml_prep

Preparing data for a model? This command does everything auto_clean does, plus Outlier Clipping, Categorical Encoding, and MinMax Scaling.

# From raw data to 'model.fit()' ready
datapilot.auto_ml_prep("users.csv", "training_data.csv")

4. Specialized: auto_analytics & auto_text_prep

  • Use auto_analytics for time-series and BI reports.
  • Use auto_text_prep for LLM and RAG workflows (it handles text cleaning and chunking).

🧠 Intelligence Advisor

Before you clean, you might want to understand what's wrong. Run the Intelligence Advisor to get a proactive report on your data health:

datapilot.analyze_dataset("mysterious_data.csv")

🛠 Becoming a Pro: The Pipeline Class

For those who need granular control, the Pipeline class is your cockpit. You can mix and match templates or define custom steps.

from datapilot import Pipeline

# Craft a custom journey
pipe = Pipeline(template="ml_preprocess")
pipe.set_source("raw.csv")
pipe.set_output("ready.csv")

# Execute with performance tracking
result = pipe.run()

📦 Installation

pip install datapilot

🏗 Why DataPilot?

  • Production-Ready: Built with registry patterns and robust error handling.
  • Memory Safe: Designed to handle large datasets without crashing your environment.
  • Intelligent: Heuristic-based suggestions that improve over time.

Happy Automating! Feel free to reach out if you need help navigating your data pipelines.


🔍 Deep Dive: Understanding the Operations

Every "auto" command in DataPilot is carefully designed to handle specific business and data needs. Here is exactly what happens under the hood:

🚀 auto_pilot (The Smart Choice)

  • Components: IntelligenceAdvisor + Recommended Template.
  • What it does: Dynamically analyzes data patterns (null counts, skewness, text length, dates) and selects the best cleaning path.
  • Why it's useful: Eliminates guesswork. Ideal for unknown or messy data when you don't know where to start. It's the "set it and forget it" tool.

auto_clean (The Gold Standard)

  • Components: normalize_columns, infer_types, handle_missing_data, deduplicate.
  • What it does: Cleans column names, casts types, interpolates IDs while filling other missing values (0 for integers, "null" for strings), and removes duplicates.
  • Why it's useful: The perfect daily cleaning tool. Ensures your data is tidy and error-free for most general tasks.

🤖 auto_ml_prep (Model Readiness)

  • Components: auto_clean steps + outlier_detection, encode_categorical, scale_numeric.
  • What it does: Beyond basic cleaning, it handles numeric outliers (via clipping), encodes text categories to numbers, and scales values (MinMax 0–1).
  • Why it's useful: High-speed preparation for model training. Most ML models (Scikit-Learn, PyTorch) require numeric, scaled data with no missing values.

📊 auto_analytics (BI & Reporting)

  • Components: normalize_columns, format_date, handle_missing_data, deduplicate, basic_aggregation.
  • What it does: Special focus on Universal Date Parsing and deduplication. Includes optional aggregation for quick reporting.
  • Why it's useful: Best for time-series data and business dashboards where consistency across dates and low redundancy is critical.

📄 auto_text_prep (LLM & RAG)

  • Components: normalize_columns, clean_text, generate_metadata, chunk_text.
  • What it does: Cleans document text, calculates word counts/lengths, and splits long text into overlapping chunks.
  • Why it's useful: Essential for AI applications. Prepares documents for embedding and storage in Vector Databases (like Pinecone or Chroma).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aidatapilot-0.2.0-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file aidatapilot-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: aidatapilot-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for aidatapilot-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25dace05ccbc228104c18a4c1c3b6e2878fcbdedbfa107bbccda58e14e27a1a9
MD5 b724e19c4b95f4e1dd3357aa16be8ee2
BLAKE2b-256 3afc04f657394f36656f0f49df449fa4d43dda306f1963373c051ec9646498e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page