Skip to main content

A tookit for exploratoriy data analysis.

Project description

PyPI version Python Support Build Status Coverage Status Code style: black GitHub last commit GitHub commits since latest release (by SemVer) CodeFactor

edapy is a first resource to analyze a new dataset.

Installation

$ pip install git+https://github.com/MartinThoma/edapy.git

For the pdf part, you also need pdftotext:

$ sudo apt-get install poppler-utils

Usage

$ edapy --help
Usage: edapy [OPTIONS] COMMAND [ARGS]...

  edapy is a tool for exploratory data analysis with Python.

  You can use it to get a first idea what a CSV is about or to get an
  overview over a directory of PDF files.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  csv     Analyze CSV files.
  images  Analyze image files.
  pdf     Analyze PDF files.

The workflow is as follows:

  • edapy pdf find --path . --output results.csv creates a results.csv for you. This results.csv contains meta data about all PDF files in the path directory.
  • edapy csv predict --csv_path my-new.csv --types types.yaml will start / resume a process in which the user is lead through a series of questions. In those questions, the user has to decide which delimiter, quotechar is used and which types the columns have.
  • edapy generates a types.yaml file which can be used to load the CSV in other applications with df = edapy.load_csv(csv_path, yaml_path).

Example types.yaml

For the Titanic Dataset, the resulting types.yaml looks as follows:

columns:
- dtype: other
  name: Name
- dtype: int
  name: Parch
- dtype: float
  name: Age
- dtype: other
  name: Ticket
- dtype: float
  name: Fare
- dtype: int
  name: PassengerId
- dtype: other
  name: Cabin
- dtype: other
  name: Embarked
- dtype: int
  name: Pclass
- dtype: int
  name: Survived
- dtype: other
  name: Sex
- dtype: int
  name: SibSp
csv_meta:
  delimiter: ','

A sample run then would look like this:

$ edapy csv predict --types types_titanik.yaml --csv_path train.csv
Number of datapoints: 891
2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'

## Integer Columns
Column name: Non-nan  mean   std   min   25%   50%   75%   max
PassengerId:     891  446.00  257.35     1   224   446   668   891
Survived   :     891  0.38  0.49     0     0     0     1     1
Pclass     :     891  2.31  0.84     1     2     3     3     3
SibSp      :     891  0.52  1.10     0     0     0     1     8
Parch      :     891  0.38  0.81     0     0     0     0     6

## Float Columns
Column name: Non-nan   mean    std    min    25%    50%    75%    max
Age        :     714  29.70  14.53   0.42  20.12  28.00  38.00  80.00
Fare       :     891  32.20  49.69   0.00   7.91  14.45  31.00  512.33

## Other Columns
Column name: Non-nan   unique   top (count)
Name       :     891      891   Goldschmidt, Mr. George B (1)
Sex        :     891        2   male (577)
Ticket     :     891      681   347082 (7)
Cabin      :     204      148   C23 C25 C27 (4)
Embarked   :     889        4   S (644)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edapy-0.4.1.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

edapy-0.4.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file edapy-0.4.1.tar.gz.

File metadata

  • Download URL: edapy-0.4.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for edapy-0.4.1.tar.gz
Algorithm Hash digest
SHA256 3aa31bcb3081d758c57683c7da4b1790a4f3b12966945ce3689cd62394e3fe9a
MD5 e41b2c672739c56f073a93da29848085
BLAKE2b-256 3a982ae9fb5287f210e9f97f50c7a5455f21459892b8fe055aaac8b9ff1c1bc5

See more details on using hashes here.

File details

Details for the file edapy-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: edapy-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for edapy-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9f1b01e9722175d53e8db08c695c280ab8739486ef96fe73263b29620a16095c
MD5 2a4c86532649963c5b30ef8f7a22f608
BLAKE2b-256 6a37e8dda30a0cd9f4f7186ec32a81073b611ac8146ce9e3ea181c6a999cfb09

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page