Skip to main content

Toolbox for easy and effective data exploration

Project description

EasyExplore

Description:

Toolbox for easy and effective data exploration in Python. It is designed to work with Jupyter notebooks especially, but it can also be used in any python module.

Table of Content:

  1. Installation
  2. Requirements
  3. Introduction
    • Practical Usage
    • Utilities
      • DataImporter
      • DataExporter
    • DataExplorer
    • DataVisualizer
    • TextMiner

1. Installation:

You can easily install EasyExplore via pip install easyexplore on every operating system.

2. Requirements:

  • dask>=2.23.0
  • emoji>=0.5.4
  • geojson>=2.5.0
  • googletrans>=3.0.0
  • ipywidgets>=0.5.1
  • joblib>=0.14.1
  • networkx>=2.2
  • nltk>=3.5
  • numpy>=1.18.1
  • pandas>=1.1.0
  • plotly>=4.5.4
  • pyod>=0.7.7.1
  • psutil>=5.5.1
  • scipy>=1.4.1
  • spacy>=2.3.2
  • scikit-learn>=0.23.1
  • sqlalchemy>=1.3.15
  • statsmodels>=0.9.0
  • wheel>=0.35.1
  • xlrd>=1.2.0

3. Introduction:

  • Practical Usage:

EasyExplore is designed as a wrapper which helps Data Scientists to explore data more convinient and efficient.

  • Data Importer:

You can easily import data set from several files as well as databases into a Pandas or dask DataFrame.

  • Data Exporter:

You can easily import data set from Pandas DataFrame or other data objects into several files or databases.

  • Data Explorer:

Explore your data set quickly and efficiently using the DataExplorer:

-- Data Typing:

    Check whether represented data types of Pandas is equal to the real data types occuring in the data

-- Data Health Check:

    Check the health of the data set in order to detecting, describing and visualizing ...
        ... the ammount of missing or invalid data vs. valid observations
        ... the amount of duplicated data
        ... the amount of invariant data

-- Data Distribution:

    Describing and visualizing statistical distribution of ...
        ... categorical features
        ... continuous features
        ... date features

-- Outlier Detection:

    Analyze outliers or anomalies of continuous features using univariate and multivariate methods:
        a) Univariate: Examines outlier values for each features separately using Inter-Quantile-Range (IQR)
        b) Multivarite: Examines outliers for each possible feature pair combined using a bunch of different machine learning algorithms. For further information just look at the PyOD packages documentation, because it is used under the hood.

-- Categorical Breakdown Statistics:

    Descriptive statistics of continuous features grouped by values of each categorical feature in the data set:


-- Correlation:

    Correlation analysis of continuous features. For analyzing multi-collinearity there is a partial correlation method implemented. The differences between marginal and partial correlations are inspected by visualizing the differences of the coefficients in a heat map as well.

-- Geo Statistics:

    Descriptive statistics of continuous features grouped by values of each geo features in the data set. Additionally, there is a geo map (OpenStreetMap) generated to visualize statistical distribution.

-- Text Analyzer:

    Analyze potential text features and generate various numerical features from those
  • Data Visualizer:

Let's make data visualization great again! Visualize your data set very easily using Plot.ly an interactive visualization library under the hood. The DataVisualizer is an efficient wrapper to abstract the most important elements for data exploration:

-- Table Chart:
    Visualize matrix (Pandas DataFrame) as an interactive table

-- Heat Map:
    Visualize value range of continuous features as heat map

-- Geo Map:
    Visualize statistics of categorical and continuous features as interactive OpenStreetMap

-- Contour Chart:
    Visualize value ranges of at least two continuous features as contours

-- Pie Chart:
    Visualize occurances of values of categorical features as an interactive pie chart

-- Bar Chart:
    Visualize occurances of values of categorical features as an interactive bar chart

-- Histogram:
    Visualize distribution of continuous features as an interactive histogram

-- Box-Whisker-Plot:
    Visualize descriptive statistics of continuous features as an interactive box-whisker-plot

-- Violin Chart:
    Visualize descriptive statistics of continuous features as an interactive violin chart

-- Parallel Category Chart:
    Visualize relationships interactively between categorical features especially, but it can also be used for mixed relations between values of categorical and continuous features by using brushing as well.

-- Parallel Coordinate Chart:
    Visualize relationships interactively between ranges of continuous features especially, but it can also be used for mixed relations between values of categorical and ranges of continuous features as well.

-- Scatter Chart:
    Visualize values of continuous features interactively.

-- Scatter3D Chart:
    Visualize values of three continuous features in one chart interactively.

-- Joint Distribution Chart:
    Visualize values of two continuous features interactively, including contours and histogram for each continuous feature.

-- Ridgeline Chart:
    Visualize changes in distribution of continuous features on certain time steps separately.

-- Line Chart:
    Visualize distribution after certain time steps as an interactive line chart.

-- Candlestick Chart:
    Visualize descritive statistics for each time steps as an interactive candlestick chart.

-- Dendrogram:
    Visualize hierarchical clusters.

-- Silhoutte Chart:
    Visualize partitionized clusters.
  • TextMiner

Explore text data (natural language) by generating various numerical features describing the text

-- Segmentation:

    Categorize potential text features into following segments ...
        -> Web features
            1) URL
            2) EMail
        -> Enumerated features
        -> Natural language (original text features)
        -> Identifier (original id features)
        -> Unknown

-- Simple text processing:
    Apply simple processing methods to text features
        -> Merge two text features by given separator
        -> Replace occurances
        -> Subset data set or feature list by given string

-- Language methods:
    Apply methods to ...
        -> ... detect language in text
        -> ... translate using Google Translate under the hood

-- Generate linguistic features:
    Apply semantic text processing to generate numeric features
        -> Clean text counter (text after removing stop words, punctuation and special character and lemmatizing)
        -> Part-of-Speech Tagging counter & labels
        -> Named Entity Recognition counter & labels
        -> Dependencies counter & labels (Tree based / Noun Chunks)
        -> Emoji counter & labels

-- Generate similarity / clustering features:
    Apply similarity methods to generate continuous features using word embeddings
        -> TF-IDF

4. Examples:

Check the jupyter notebook for examples. Happy exploration :)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easyexplore-0.4.1.tar.gz (96.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

easyexplore-0.4.1-py3.7.egg (206.5 kB view details)

Uploaded Egg

easyexplore-0.4.1-py3-none-any.whl (105.6 kB view details)

Uploaded Python 3

File details

Details for the file easyexplore-0.4.1.tar.gz.

File metadata

  • Download URL: easyexplore-0.4.1.tar.gz
  • Upload date:
  • Size: 96.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.3

File hashes

Hashes for easyexplore-0.4.1.tar.gz
Algorithm Hash digest
SHA256 1d854f25b94055448d11f5c35e5da4d0a4e40b935a3b9468a9fc330cb9fdf071
MD5 a605eb3c2ad23f67c5ac138608f6d753
BLAKE2b-256 3383722705d806adbd8fd7b0d8864f553ae900b803a0a40b4a8c2fa2badc8f62

See more details on using hashes here.

File details

Details for the file easyexplore-0.4.1-py3.7.egg.

File metadata

  • Download URL: easyexplore-0.4.1-py3.7.egg
  • Upload date:
  • Size: 206.5 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.3

File hashes

Hashes for easyexplore-0.4.1-py3.7.egg
Algorithm Hash digest
SHA256 e91098f3e0f3ac48b2b5234a1f4df0e70125550bb2bf818a1dadc212d715703b
MD5 a61d3be9b8d203ad35bf39a5ea2932b0
BLAKE2b-256 8f87c04390c7efb7f580b518cfc148cf125ab3b6a9544ca405f405f080b3d224

See more details on using hashes here.

File details

Details for the file easyexplore-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: easyexplore-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 105.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.3

File hashes

Hashes for easyexplore-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 916d0e57901ce41ea713b2c14e9ed3d8f659fca62c6da0f2ce1b83d764b7ed02
MD5 6d5ebbeaa1d508e8546f1e484c9296c6
BLAKE2b-256 3237b2196acf94e1835940041a871724032741a52ddd7d91ddc5622acfb504b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page