A python data exploration and analytics package
Project description
rosebud
rosebud
is an open-source tool for pulling CSVs into python, and immediately extracting basic statistics (mean, median, mode, quartile
data, etc.) into variables, and has the capability of plotting these preliminary statistics via seaborn pairplots. rosebud also uses the visual capabilities of missingno to show a visual representation of the missing data within the CSVs.
rosebud is currently a single-handed project, and welcomes the community for contributions.
Installation
Current implementation will require user to download the script into the working directory, and import into script by using
from rosebud import *
rosebud will be installable via the pip package manager in the future.
Required packages:
- os
- numpy
- pandas
- matplotlib.pyplot
- seaborn
- missingno
- warnings
Usage
tablesandstats()
tablesandstats(filepath, show_plots = 'all')
rosebud's tablesandstats()
function is the libraries main function that does the following:
- Import the .csv file into your workspace, and turn the file into a DataFrame.
- Create the statistic variables of the DataFrame, following the pattern name
columnName_indexName
, derived from python'sdf.summary()
function. - Generate a correlation matrix heatmap of the DataFrame.
- Generate a "missing value ratio" grid for your columns, which shows what percentage of the data are null/NaN values.
- Creates a visualization of the missing valuves via missingno's
missingo.matrix()
function
*for the following example, we will be using a file called 'Future_500'
Example:
calling:
tablesandvars("C:/Users/YourName/.../Future_500.csv")
returns:
Rosebud is creating tables and statistics from Future 500...
Measures of Center and Basic Descriptive Statistics of Future 500:
ID Inception Employees Profit
count 500.000000 499.000000 498.000000 4.980000e+02
mean 250.500000 2010.174349 148.610442 6.539474e+06
std 144.481833 3.228211 397.353657 3.869934e+06
min 1.000000 1999.000000 1.000000 1.243400e+04
25% 125.750000 2009.000000 27.250000 3.272074e+06
50% 250.500000 2011.000000 56.000000 6.513366e+06
75% 375.250000 2012.000000 126.000000 9.303951e+06
max 500.000000 2014.000000 7125.000000 1.962453e+07
Feature Data Types of Future 500:
ID int64
Name object
Industry object
Inception float64
Employees float64
State object
City object
Revenue object
Expenses object
Profit float64
Growth object
Feature Corrrelations of Future 500:
Dataset completeness of Future 500:
- Future 500 missing value ratio (percentage):
ID 0.0
Name 0.0
Industry 0.4
Inception 0.2
Employees 0.4
State 0.8
City 0.0
Revenue 0.4
Expenses 0.6
Profit 0.4
Growth 0.2
- Visual representation of missing value ratio:
Tables created:
* Future_500
* Future_500_Normalized
variable example from above process:
print(NAICS_count)
231985.0
tablesandstats() parameters:
-
filepath = the directory of the .csv file
-
show_plots = you can choose which specific plots are shown on screen. takes the following values:
- 'all': show all charts
- 'none': show no charts
- 'heatmap': show correlation grid only
- 'completeness': show missing data visualization only
processfolder()
processfolder(folderpath, show_plots = 'all')
processfolder()
is a wrapper for tablesandstats()
which allows you to perform the tablesandstats()
function on all .csv files in a folder. This function takes in the same parameters as tablesandstats()
, but expects a folder path containing the .csv files, instead of the individual file path.
survey()
survey(filepath, filter_by = 'all', regress = False)
survey()
takes in the file path, normalizes data, and performs pair plotting of the features as determined by correlation grid, stratified by levels of correlation.
The function also prints out the pairwise correlations stratification of the features.
*for the following example, we will be using a file called 'data_numsOnly'
Example:
calling:
survey("C:/Users/YourName/.../data_numsOnly.csv", filter_by = 'strong_pos')
returns:
Rosebud is surveying out the data in data numsOnly (note: graphical scale is derived from a normalized data set)
!! NOTE: data numsOnly contains NaN values, which may affect true correlation value !!
Strong positive correlations:
[['Establishments' 'Average Employment']
['Establishments' 'Total Wage']
['Average Employment' 'Total Wage']]
Weak positive correlations:
[['NAICS' 'Year']]
Features with no correlations:
[['NAICS' 'Establishments']
['NAICS' 'Average Employment']
['NAICS' 'Total Wage']
['NAICS' 'Annual Average Salary']
['Year' 'Establishments']
['Year' 'Average Employment']
['Year' 'Total Wage']
['Year' 'Annual Average Salary']
['Establishments' 'Annual Average Salary']
['Establishments' 'Years Active']
['Average Employment' 'Annual Average Salary']
['Average Employment' 'Years Active']
['Total Wage' 'Annual Average Salary']
['Total Wage' 'Years Active']
['Annual Average Salary' 'Years Active']]
Weak Negative correlations:
[['NAICS' 'Years Active']]
Strong Negative correlations:
[['Year' 'Years Active']]
Pairwise relationship graphs of strong positive correlation features:
survey() parameters:
-
filepath = the directory of the .csv file
-
filter_by = select the stata of correlation you want plotted:
- 'all' = plot all correlations (note: large datasets will take a long time for visualization)
- 'strong_pos' = strong positive correlation
- 'weak_pos' = weak positive correlation
- 'no_corr' No correlation
- 'weak_neg' = weak negative correlation
- 'strong_neg' = strong negative correlation
-
regress = include a best-fit line to the pairplots
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rosebud-0.1.tar.gz
.
File metadata
- Download URL: rosebud-0.1.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e7ea33b393779c309d313d628180240e2392866dbba8c526731f9b3aa392c13 |
|
MD5 | 25845401c2b605885f278d2f59fbb306 |
|
BLAKE2b-256 | c988510d41244ea4a4f91026b9cb8299bd1ef0be9c73b833d14caf05216e6f93 |
File details
Details for the file rosebud-0.1-py3-none-any.whl
.
File metadata
- Download URL: rosebud-0.1-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68b9512aad5bcb32de3dcbdc2a199a23bacd9ba2496e4f72b85beae4ad67a413 |
|
MD5 | e83a4e427a63c13484eea98350af1d5b |
|
BLAKE2b-256 | 0ca5ae60bc40f79cf22d4fd709d521b2c6e5d978755efa7cc08226f4552bc4fc |