Skip to main content

Advanced Data Analytics Package with cleaning, statistics, and visualization

Project description

DataX - Advanced Data Analytics Package

Python Version License PyPI Version Downloads Build Status Coverage Documentation Status

DataX is a comprehensive Python package for data analytics that provides powerful tools for data cleaning, statistical analysis, and visualization. Built with modern Python practices, it offers both programmatic and command-line interfaces for maximum flexibility.

🚀 Features

Core Functionality

  • Advanced Data Cleaning: Missing value handling, outlier detection, data validation, type conversion
  • Comprehensive Statistics: Descriptive statistics, correlation analysis, hypothesis testing, regression analysis
  • Rich Visualizations: Statistical plots, interactive charts, customizable themes, export capabilities
  • Command Line Interface: Full CLI support with interactive mode and batch processing
  • High Performance: Optimized for large datasets with efficient memory usage

Advanced Features

  • Interactive Mode: Jupyter notebook integration and interactive plotting
  • Statistical Modeling: Linear regression, ANOVA, normality testing
  • Data Validation: Custom rule-based validation with comprehensive reporting
  • Export Capabilities: Multiple output formats (CSV, Excel, JSON, Parquet)
  • Extensible Architecture: Plugin system for custom analyzers and visualizers

📦 Installation

From PyPI (Recommended)

pip install datax-py

From Source

git clone https://github.com/amirbekazimov/datax-py.git
cd datax-py
pip install -e .

With Optional Dependencies

# For development
pip install datax-py[dev]

# For documentation
pip install datax-py[docs]

# For Jupyter integration
pip install datax[jupyter]

# All optional dependencies
pip install datax[all]

🎯 Quick Start

Python API

import pandas as pd
from datax import DataCleaner, DataAnalyzer, DataVisualizer

# Load your data
df = pd.read_csv('your_data.csv')

# Data Cleaning
cleaner = DataCleaner(df)
cleaner.handle_missing_values(method='auto')
cleaner.remove_duplicates()
cleaner.handle_outliers(method='iqr', action='cap')
cleaned_data = cleaner.data

# Statistical Analysis
analyzer = DataAnalyzer(cleaned_data)
desc_stats = analyzer.get_descriptive_stats()
correlation = analyzer.get_correlation_matrix()
regression = analyzer.regression_analysis('target_column', ['feature1', 'feature2'])

# Visualization
visualizer = DataVisualizer(cleaned_data)
visualizer.plot_distribution('column_name')
visualizer.plot_correlation_heatmap()
visualizer.plot_multiple_distributions(['col1', 'col2', 'col3'])

Command Line Interface

# Load data and get information
datax load data.csv info

# Clean data with auto missing value handling
datax load data.csv clean --missing auto --remove-duplicates

# Perform statistical analysis
datax load data.csv stats --descriptive --correlation

# Create visualizations
datax load data.csv viz --distributions --correlation-heatmap

# Interactive mode
datax interactive --file data.csv

📊 Examples

Data Cleaning Pipeline

from datax import DataCleaner
import pandas as pd

# Load data
df = pd.read_csv('messy_data.csv')

# Initialize cleaner
cleaner = DataCleaner(df)

# Comprehensive cleaning pipeline
cleaner.handle_missing_values(method='auto') \
       .remove_duplicates() \
       .handle_outliers(method='iqr', action='cap') \
       .convert_data_types(auto_convert=True) \
       .validate_data()

# Get cleaning summary
summary = cleaner.get_cleaning_summary()
print(f"Original shape: {summary['original_shape']}")
print(f"Final shape: {summary['current_shape']}")

# Save cleaned data
cleaner.save_cleaned_data('cleaned_data.csv')

Statistical Analysis

from datax import DataAnalyzer

analyzer = DataAnalyzer(df)

# Descriptive statistics
desc_stats = analyzer.get_descriptive_stats()

# Correlation analysis
correlation = analyzer.get_correlation_matrix(method='pearson')

# Hypothesis testing
ttest_result = analyzer.hypothesis_test('ttest', 
                                       column1='group1', 
                                       column2='group2')

# Regression analysis
regression = analyzer.regression_analysis('target', 
                                        ['feature1', 'feature2', 'feature3'])

# ANOVA analysis
anova = analyzer.anova_analysis('value_column', 'group_column')

# Export results
analyzer.export_results('analysis_results.json')

Advanced Visualizations

from datax import DataVisualizer

visualizer = DataVisualizer(df, style='colorful')

# Distribution plots
visualizer.plot_distribution('numeric_column', plot_type='histogram', kde=True)

# Correlation heatmap
visualizer.plot_correlation_heatmap(annot=True)

# Multiple distributions
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
visualizer.plot_multiple_distributions(numeric_cols[:6])

# Interactive plots
interactive_fig = visualizer.create_interactive_plot('scatter',
                                                   x_column='x',
                                                   y_column='y',
                                                   color_column='category')

# Save plots
visualizer.save_plot(fig, 'output.png', format='png', dpi=300)

🛠️ CLI Usage

Basic Commands

# Show help
datax --help

# Load and analyze data
datax load data.csv info
datax load data.csv clean --missing auto
datax load data.csv stats --descriptive --correlation
datax load data.csv viz --distributions --correlation-heatmap

# Interactive mode
datax interactive --file data.csv

Advanced CLI Features

# Batch processing
datax batch config.json

# Custom output formats
datax load data.csv clean --output cleaned_data.xlsx --format excel

# Verbose output
datax load data.csv stats --descriptive --verbose

# Save plots
datax load data.csv viz --distributions --save-plots ./plots/

📈 Performance

DataX is optimized for performance with large datasets:

  • Memory Efficient: Uses pandas' efficient data structures
  • Vectorized Operations: Leverages NumPy and pandas vectorization
  • Lazy Evaluation: Computes statistics only when needed
  • Parallel Processing: Supports multiprocessing for large datasets
  • Caching: Intelligent caching of computed results

🔧 Configuration

Custom Themes and Styles

# Set custom visualization style
visualizer = DataVisualizer(df, style='dark')
visualizer.set_style('minimal')

# Custom color palettes
import seaborn as sns
sns.set_palette("Set2")

Advanced Configuration

# Custom validation rules
validation_rules = {
    "age_range": {
        "type": "range",
        "column": "age",
        "min": 0,
        "max": 120
    },
    "unique_id": {
        "type": "unique",
        "column": "id"
    }
}

cleaner.validate_data(rules=validation_rules, strict=True)

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/amirbekazimov/datax-py.git
cd datax-py
pip install -e ".[dev]"
pre-commit install
pytest

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=datax --cov-report=html

# Run specific test categories
pytest -m "not slow"
pytest -m integration

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

🗺️ Roadmap

  • Machine learning integration
  • Time series analysis
  • Geospatial data support
  • Web dashboard interface
  • Real-time data processing
  • Cloud deployment support

DataX - Making data analytics accessible, powerful, and enjoyable! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datax_py-1.0.0.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datax_py-1.0.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file datax_py-1.0.0.tar.gz.

File metadata

  • Download URL: datax_py-1.0.0.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for datax_py-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6c86914d4acae662f4b52dc9655eff1432c7c930caaaa3ef8f50f4e6cbd8ecd0
MD5 39d76039b745f70fbfbc844890bb84e2
BLAKE2b-256 01c208d7c956d2c1cf5c1719a4186c2009b47de73b8258a6c2e6be0758dac533

See more details on using hashes here.

File details

Details for the file datax_py-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: datax_py-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for datax_py-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0724eca340cef1a07b7c15539993338e1d7b18f524b5bf211648abd0247c90a1
MD5 84e5863bf8241c6e84e140486ec44e85
BLAKE2b-256 92e0be011c4d823d8df48816c4deee65d06a3aae49b62e5945c093b2b4d6ad77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page