A tookit for exploratoriy data analysis.
Project description
edapy is a first resource to analyze a new dataset.
Installation
$ pip install git+https://github.com/MartinThoma/edapy.git
For the pdf part, you also need pdftotext
:
$ sudo apt-get install poppler-utils
Usage
$ edapy --help
Usage: edapy [OPTIONS] COMMAND [ARGS]...
edapy is a tool for exploratory data analysis with Python.
You can use it to get a first idea what a CSV is about or to get an
overview over a directory of PDF files.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
csv Analyze CSV files.
images Analyze image files.
pdf Analyze PDF files.
The workflow is as follows:
edapy pdf find --path . --output results.csv
creates aresults.csv
for you. Thisresults.csv
contains meta data about all PDF files in thepath
directory.edapy csv predict --csv_path my-new.csv --types types.yaml
will start / resume a process in which the user is lead through a series of questions. In those questions, the user has to decide which delimiter, quotechar is used and which types the columns have.edapy
generates atypes.yaml
file which can be used to load the CSV in other applications withdf = edapy.load_csv(csv_path, yaml_path)
.
Example types.yaml
For the Titanic Dataset, the resulting
types.yaml
looks as follows:
columns:
- dtype: other
name: Name
- dtype: int
name: Parch
- dtype: float
name: Age
- dtype: other
name: Ticket
- dtype: float
name: Fare
- dtype: int
name: PassengerId
- dtype: other
name: Cabin
- dtype: other
name: Embarked
- dtype: int
name: Pclass
- dtype: int
name: Survived
- dtype: other
name: Sex
- dtype: int
name: SibSp
csv_meta:
delimiter: ','
A sample run then would look like this:
$ edapy csv predict --types types_titanik.yaml --csv_path train.csv
Number of datapoints: 891
2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'
## Integer Columns
Column name: Non-nan mean std min 25% 50% 75% max
PassengerId: 891 446.00 257.35 1 224 446 668 891
Survived : 891 0.38 0.49 0 0 0 1 1
Pclass : 891 2.31 0.84 1 2 3 3 3
SibSp : 891 0.52 1.10 0 0 0 1 8
Parch : 891 0.38 0.81 0 0 0 0 6
## Float Columns
Column name: Non-nan mean std min 25% 50% 75% max
Age : 714 29.70 14.53 0.42 20.12 28.00 38.00 80.00
Fare : 891 32.20 49.69 0.00 7.91 14.45 31.00 512.33
## Other Columns
Column name: Non-nan unique top (count)
Name : 891 891 Goldschmidt, Mr. George B (1)
Sex : 891 2 male (577)
Ticket : 891 681 347082 (7)
Cabin : 204 148 C23 C25 C27 (4)
Embarked : 889 4 S (644)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
edapy-0.4.1.tar.gz
(16.0 kB
view hashes)
Built Distribution
edapy-0.4.1-py3-none-any.whl
(16.6 kB
view hashes)