Skip to main content

A tookit for exploratoriy data analysis.

Project description

Python Support Build Status Coverage Status

edapy is a first resource to analyze a new dataset.

Installation

$ pip install git+https://github.com/MartinThoma/edapy.git

For the pdf part, you also need pdftotext:

$ sudo apt-get install poppler-utils

Usage

$ edapy --help
Usage: edapy [OPTIONS] COMMAND [ARGS]...

  edapy is a tool for exploratory data analysis with Python.

  You can use it to get a first idea what a CSV is about or to get an
  overview over a directory of PDF files.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  csv     Analyze CSV files.
  images  Analyze image files.
  pdf     Analyze PDF files.

The workflow is as follows:

  • edapy pdf find --path . --output results.csv creates a results.csv for you. This results.csv contains meta data about all PDF files in the path directory.
  • edapy csv predict --csv_path my-new.csv --types types.yaml will start / resume a process in which the user is lead through a series of questions. In those questions, the user has to decide which delimiter, quotechar is used and which types the columns have.
  • edapy generates a types.yaml file which can be used to load the CSV in other applications with df = edapy.load_csv(csv_path, yaml_path).

Example types.yaml

For the Titanic Dataset, the resulting types.yaml looks as follows:

columns:
- dtype: other
  name: Name
- dtype: int
  name: Parch
- dtype: float
  name: Age
- dtype: other
  name: Ticket
- dtype: float
  name: Fare
- dtype: int
  name: PassengerId
- dtype: other
  name: Cabin
- dtype: other
  name: Embarked
- dtype: int
  name: Pclass
- dtype: int
  name: Survived
- dtype: other
  name: Sex
- dtype: int
  name: SibSp
csv_meta:
  delimiter: ','

A sample run then would look like this:

$ edapy csv predict --types types_titanik.yaml --csv_path train.csv
Number of datapoints: 891
2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'

## Integer Columns
Column name: Non-nan  mean   std   min   25%   50%   75%   max
PassengerId:     891  446.00  257.35     1   224   446   668   891
Survived   :     891  0.38  0.49     0     0     0     1     1
Pclass     :     891  2.31  0.84     1     2     3     3     3
SibSp      :     891  0.52  1.10     0     0     0     1     8
Parch      :     891  0.38  0.81     0     0     0     0     6

## Float Columns
Column name: Non-nan   mean    std    min    25%    50%    75%    max
Age        :     714  29.70  14.53   0.42  20.12  28.00  38.00  80.00
Fare       :     891  32.20  49.69   0.00   7.91  14.45  31.00  512.33

## Other Columns
Column name: Non-nan   unique   top (count)
Name       :     891      891   Goldschmidt, Mr. George B (1)
Sex        :     891        2   male (577)
Ticket     :     891      681   347082 (7)
Cabin      :     204      148   C23 C25 C27 (4)
Embarked   :     889        4   S (644)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
edapy-0.2.3-py3-none-any.whl (14.2 kB) Copy SHA256 hash SHA256 Wheel py3 Jun 9, 2018
edapy-0.2.3.tar.gz (13.3 kB) Copy SHA256 hash SHA256 Source None Jun 9, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page