Skip to main content

Advanced Data Analytics Package with cleaning, statistics, and visualization

Project description

DataX - Advanced Data Analytics Package

Python Version License PyPI Version Downloads Build Status Coverage Documentation Status

DataX is a comprehensive Python package for data analytics that provides powerful tools for data cleaning, statistical analysis, and visualization. Built with modern Python practices, it offers both programmatic and command-line interfaces for maximum flexibility.

🚀 Features

Core Functionality

  • Advanced Data Cleaning: Missing value handling, outlier detection, data validation, type conversion
  • Comprehensive Statistics: Descriptive statistics, correlation analysis, hypothesis testing, regression analysis
  • Rich Visualizations: Statistical plots, interactive charts, customizable themes, export capabilities
  • Command Line Interface: Full CLI support with interactive mode and batch processing
  • High Performance: Optimized for large datasets with efficient memory usage

Advanced Features

  • Interactive Mode: Jupyter notebook integration and interactive plotting
  • Statistical Modeling: Linear regression, ANOVA, normality testing
  • Data Validation: Custom rule-based validation with comprehensive reporting
  • Export Capabilities: Multiple output formats (CSV, Excel, JSON, Parquet)
  • Extensible Architecture: Plugin system for custom analyzers and visualizers

📦 Installation

From PyPI (Recommended)

pip install datax

From Source

git clone https://github.com/datax/datax.git
cd datax
pip install -e .

With Optional Dependencies

# For development
pip install datax[dev]

# For documentation
pip install datax[docs]

# For Jupyter integration
pip install datax[jupyter]

# All optional dependencies
pip install datax[all]

🎯 Quick Start

Python API

import pandas as pd
from datax import DataCleaner, DataAnalyzer, DataVisualizer

# Load your data
df = pd.read_csv('your_data.csv')

# Data Cleaning
cleaner = DataCleaner(df)
cleaner.handle_missing_values(method='auto')
cleaner.remove_duplicates()
cleaner.handle_outliers(method='iqr', action='cap')
cleaned_data = cleaner.data

# Statistical Analysis
analyzer = DataAnalyzer(cleaned_data)
desc_stats = analyzer.get_descriptive_stats()
correlation = analyzer.get_correlation_matrix()
regression = analyzer.regression_analysis('target_column', ['feature1', 'feature2'])

# Visualization
visualizer = DataVisualizer(cleaned_data)
visualizer.plot_distribution('column_name')
visualizer.plot_correlation_heatmap()
visualizer.plot_multiple_distributions(['col1', 'col2', 'col3'])

Command Line Interface

# Load data and get information
datax load data.csv info

# Clean data with auto missing value handling
datax load data.csv clean --missing auto --remove-duplicates

# Perform statistical analysis
datax load data.csv stats --descriptive --correlation

# Create visualizations
datax load data.csv viz --distributions --correlation-heatmap

# Interactive mode
datax interactive --file data.csv

📊 Examples

Data Cleaning Pipeline

from datax import DataCleaner
import pandas as pd

# Load data
df = pd.read_csv('messy_data.csv')

# Initialize cleaner
cleaner = DataCleaner(df)

# Comprehensive cleaning pipeline
cleaner.handle_missing_values(method='auto') \
       .remove_duplicates() \
       .handle_outliers(method='iqr', action='cap') \
       .convert_data_types(auto_convert=True) \
       .validate_data()

# Get cleaning summary
summary = cleaner.get_cleaning_summary()
print(f"Original shape: {summary['original_shape']}")
print(f"Final shape: {summary['current_shape']}")

# Save cleaned data
cleaner.save_cleaned_data('cleaned_data.csv')

Statistical Analysis

from datax import DataAnalyzer

analyzer = DataAnalyzer(df)

# Descriptive statistics
desc_stats = analyzer.get_descriptive_stats()

# Correlation analysis
correlation = analyzer.get_correlation_matrix(method='pearson')

# Hypothesis testing
ttest_result = analyzer.hypothesis_test('ttest', 
                                       column1='group1', 
                                       column2='group2')

# Regression analysis
regression = analyzer.regression_analysis('target', 
                                        ['feature1', 'feature2', 'feature3'])

# ANOVA analysis
anova = analyzer.anova_analysis('value_column', 'group_column')

# Export results
analyzer.export_results('analysis_results.json')

Advanced Visualizations

from datax import DataVisualizer

visualizer = DataVisualizer(df, style='colorful')

# Distribution plots
visualizer.plot_distribution('numeric_column', plot_type='histogram', kde=True)

# Correlation heatmap
visualizer.plot_correlation_heatmap(annot=True)

# Multiple distributions
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
visualizer.plot_multiple_distributions(numeric_cols[:6])

# Interactive plots
interactive_fig = visualizer.create_interactive_plot('scatter',
                                                   x_column='x',
                                                   y_column='y',
                                                   color_column='category')

# Save plots
visualizer.save_plot(fig, 'output.png', format='png', dpi=300)

🛠️ CLI Usage

Basic Commands

# Show help
datax --help

# Load and analyze data
datax load data.csv info
datax load data.csv clean --missing auto
datax load data.csv stats --descriptive --correlation
datax load data.csv viz --distributions --correlation-heatmap

# Interactive mode
datax interactive --file data.csv

Advanced CLI Features

# Batch processing
datax batch config.json

# Custom output formats
datax load data.csv clean --output cleaned_data.xlsx --format excel

# Verbose output
datax load data.csv stats --descriptive --verbose

# Save plots
datax load data.csv viz --distributions --save-plots ./plots/

📈 Performance

DataX is optimized for performance with large datasets:

  • Memory Efficient: Uses pandas' efficient data structures
  • Vectorized Operations: Leverages NumPy and pandas vectorization
  • Lazy Evaluation: Computes statistics only when needed
  • Parallel Processing: Supports multiprocessing for large datasets
  • Caching: Intelligent caching of computed results

🔧 Configuration

Custom Themes and Styles

# Set custom visualization style
visualizer = DataVisualizer(df, style='dark')
visualizer.set_style('minimal')

# Custom color palettes
import seaborn as sns
sns.set_palette("Set2")

Advanced Configuration

# Custom validation rules
validation_rules = {
    "age_range": {
        "type": "range",
        "column": "age",
        "min": 0,
        "max": 120
    },
    "unique_id": {
        "type": "unique",
        "column": "id"
    }
}

cleaner.validate_data(rules=validation_rules, strict=True)

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/datax/datax.git
cd datax
pip install -e ".[dev]"
pre-commit install
pytest

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=datax --cov-report=html

# Run specific test categories
pytest -m "not slow"
pytest -m integration

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

🗺️ Roadmap

  • Machine learning integration
  • Time series analysis
  • Geospatial data support
  • Web dashboard interface
  • Real-time data processing
  • Cloud deployment support

DataX - Making data analytics accessible, powerful, and enjoyable! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datax_analytics-1.0.0.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datax_analytics-1.0.0-py3-none-any.whl (25.1 kB view details)

Uploaded Python 3

File details

Details for the file datax_analytics-1.0.0.tar.gz.

File metadata

  • Download URL: datax_analytics-1.0.0.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for datax_analytics-1.0.0.tar.gz
Algorithm Hash digest
SHA256 55a8bfcfe0bb6ea0c04401dc7b8b52c10c2721f02a1bf5f7c83ca64ef94df5a9
MD5 cc1a3b24c8af693b762ac829a14bfaf1
BLAKE2b-256 50df181c425a28fc8a9c78e1d00446122fc2e0aef86a173382d32bd95397fa35

See more details on using hashes here.

File details

Details for the file datax_analytics-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datax_analytics-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7f5659be08bcfd5663e483ca129ee9c17c4b79a3a1b2b2152fb631d48d14ce29
MD5 4899959df08500eb34afbbf585eb012b
BLAKE2b-256 17ca49e90a868be0dfd2f929c1c08b3d00f7acb417a59823591131b22de57aa6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page