Advanced Data Analytics Package with cleaning, statistics, and visualization
Project description
DataX - Advanced Data Analytics Package
DataX is a comprehensive Python package for data analytics that provides powerful tools for data cleaning, statistical analysis, and visualization. Built with modern Python practices, it offers both programmatic and command-line interfaces for maximum flexibility.
🚀 Features
Core Functionality
- Advanced Data Cleaning: Missing value handling, outlier detection, data validation, type conversion
- Comprehensive Statistics: Descriptive statistics, correlation analysis, hypothesis testing, regression analysis
- Rich Visualizations: Statistical plots, interactive charts, customizable themes, export capabilities
- Command Line Interface: Full CLI support with interactive mode and batch processing
- High Performance: Optimized for large datasets with efficient memory usage
Advanced Features
- Interactive Mode: Jupyter notebook integration and interactive plotting
- Statistical Modeling: Linear regression, ANOVA, normality testing
- Data Validation: Custom rule-based validation with comprehensive reporting
- Export Capabilities: Multiple output formats (CSV, Excel, JSON, Parquet)
- Extensible Architecture: Plugin system for custom analyzers and visualizers
📦 Installation
From PyPI (Recommended)
pip install datax-py
From Source
git clone https://github.com/amirbekazimov/datax-py.git
cd datax-py
pip install -e .
With Optional Dependencies
# For development
pip install datax-py[dev]
# For documentation
pip install datax-py[docs]
# For Jupyter integration
pip install datax[jupyter]
# All optional dependencies
pip install datax[all]
🎯 Quick Start
Python API
import pandas as pd
from datax import DataCleaner, DataAnalyzer, DataVisualizer
# Load your data
df = pd.read_csv('your_data.csv')
# Data Cleaning
cleaner = DataCleaner(df)
cleaner.handle_missing_values(method='auto')
cleaner.remove_duplicates()
cleaner.handle_outliers(method='iqr', action='cap')
cleaned_data = cleaner.data
# Statistical Analysis
analyzer = DataAnalyzer(cleaned_data)
desc_stats = analyzer.get_descriptive_stats()
correlation = analyzer.get_correlation_matrix()
regression = analyzer.regression_analysis('target_column', ['feature1', 'feature2'])
# Visualization
visualizer = DataVisualizer(cleaned_data)
visualizer.plot_distribution('column_name')
visualizer.plot_correlation_heatmap()
visualizer.plot_multiple_distributions(['col1', 'col2', 'col3'])
Command Line Interface
# Load data and get information
datax load data.csv info
# Clean data with auto missing value handling
datax load data.csv clean --missing auto --remove-duplicates
# Perform statistical analysis
datax load data.csv stats --descriptive --correlation
# Create visualizations
datax load data.csv viz --distributions --correlation-heatmap
# Interactive mode
datax interactive --file data.csv
📊 Examples
Data Cleaning Pipeline
from datax import DataCleaner
import pandas as pd
# Load data
df = pd.read_csv('messy_data.csv')
# Initialize cleaner
cleaner = DataCleaner(df)
# Comprehensive cleaning pipeline
cleaner.handle_missing_values(method='auto') \
.remove_duplicates() \
.handle_outliers(method='iqr', action='cap') \
.convert_data_types(auto_convert=True) \
.validate_data()
# Get cleaning summary
summary = cleaner.get_cleaning_summary()
print(f"Original shape: {summary['original_shape']}")
print(f"Final shape: {summary['current_shape']}")
# Save cleaned data
cleaner.save_cleaned_data('cleaned_data.csv')
Statistical Analysis
from datax import DataAnalyzer
analyzer = DataAnalyzer(df)
# Descriptive statistics
desc_stats = analyzer.get_descriptive_stats()
# Correlation analysis
correlation = analyzer.get_correlation_matrix(method='pearson')
# Hypothesis testing
ttest_result = analyzer.hypothesis_test('ttest',
column1='group1',
column2='group2')
# Regression analysis
regression = analyzer.regression_analysis('target',
['feature1', 'feature2', 'feature3'])
# ANOVA analysis
anova = analyzer.anova_analysis('value_column', 'group_column')
# Export results
analyzer.export_results('analysis_results.json')
Advanced Visualizations
from datax import DataVisualizer
visualizer = DataVisualizer(df, style='colorful')
# Distribution plots
visualizer.plot_distribution('numeric_column', plot_type='histogram', kde=True)
# Correlation heatmap
visualizer.plot_correlation_heatmap(annot=True)
# Multiple distributions
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
visualizer.plot_multiple_distributions(numeric_cols[:6])
# Interactive plots
interactive_fig = visualizer.create_interactive_plot('scatter',
x_column='x',
y_column='y',
color_column='category')
# Save plots
visualizer.save_plot(fig, 'output.png', format='png', dpi=300)
🛠️ CLI Usage
Basic Commands
# Show help
datax --help
# Load and analyze data
datax load data.csv info
datax load data.csv clean --missing auto
datax load data.csv stats --descriptive --correlation
datax load data.csv viz --distributions --correlation-heatmap
# Interactive mode
datax interactive --file data.csv
Advanced CLI Features
# Batch processing
datax batch config.json
# Custom output formats
datax load data.csv clean --output cleaned_data.xlsx --format excel
# Verbose output
datax load data.csv stats --descriptive --verbose
# Save plots
datax load data.csv viz --distributions --save-plots ./plots/
📈 Performance
DataX is optimized for performance with large datasets:
- Memory Efficient: Uses pandas' efficient data structures
- Vectorized Operations: Leverages NumPy and pandas vectorization
- Lazy Evaluation: Computes statistics only when needed
- Parallel Processing: Supports multiprocessing for large datasets
- Caching: Intelligent caching of computed results
🔧 Configuration
Custom Themes and Styles
# Set custom visualization style
visualizer = DataVisualizer(df, style='dark')
visualizer.set_style('minimal')
# Custom color palettes
import seaborn as sns
sns.set_palette("Set2")
Advanced Configuration
# Custom validation rules
validation_rules = {
"age_range": {
"type": "range",
"column": "age",
"min": 0,
"max": 120
},
"unique_id": {
"type": "unique",
"column": "id"
}
}
cleaner.validate_data(rules=validation_rules, strict=True)
📚 Documentation
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
git clone https://github.com/amirbekazimov/datax-py.git
cd datax-py
pip install -e ".[dev]"
pre-commit install
pytest
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=datax --cov-report=html
# Run specific test categories
pytest -m "not slow"
pytest -m integration
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built on top of the amazing pandas library
- Visualization powered by matplotlib and seaborn
- Statistical functions from scipy and scikit-learn
- Interactive plots with plotly
📞 Support
- Documentation: https://datax-py.readthedocs.io
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: amirbekazimov7@gmail.com
🗺️ Roadmap
- Machine learning integration
- Time series analysis
- Geospatial data support
- Web dashboard interface
- Real-time data processing
- Cloud deployment support
DataX - Making data analytics accessible, powerful, and enjoyable! 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datax_py-1.0.0.tar.gz.
File metadata
- Download URL: datax_py-1.0.0.tar.gz
- Upload date:
- Size: 47.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c86914d4acae662f4b52dc9655eff1432c7c930caaaa3ef8f50f4e6cbd8ecd0
|
|
| MD5 |
39d76039b745f70fbfbc844890bb84e2
|
|
| BLAKE2b-256 |
01c208d7c956d2c1cf5c1719a4186c2009b47de73b8258a6c2e6be0758dac533
|
File details
Details for the file datax_py-1.0.0-py3-none-any.whl.
File metadata
- Download URL: datax_py-1.0.0-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0724eca340cef1a07b7c15539993338e1d7b18f524b5bf211648abd0247c90a1
|
|
| MD5 |
84e5863bf8241c6e84e140486ec44e85
|
|
| BLAKE2b-256 |
92e0be011c4d823d8df48816c4deee65d06a3aae49b62e5945c093b2b4d6ad77
|