Python support for 'The Art and Science of Data Analytics'
Project description
AdvancedAnalytics
A collection of python modules, classes and methods for simplifying building machine learning solutions. This was developed to simplify learning python, and it accompanies the book The Art and Science of Data Analytics.
Description
Machine learning applications progress through three stages:
Data Preprocessing
Modeling or Analytics
Postprocessing
The classes and methods in AdvancedAnalytics primarily support the first and last stages of machine learning applications.
Surprisingly data scientists report they typically spend 80% of their total effort in data preprocessing and postprocessing. The first stage is concerned with preparing the data for analysis.
identifying and correcting outliers,
imputing missing values, and
encoding data.
The last stage, solution postprocessing, involves displaying and graphing solution summaries as well as metrics and graphics used to evaluate the quality of the solution.
Usage
Currently the most popular usage is for supporting solutions developed using these popular machine learning packages:
Sci-Learn
StatsModels
NLTK
Current Modules and Classes
- ReplaceImputeEncode
- Classes for Data Preprocessing
DT defines new data types used in the data dictionary
ReplaceImputeEncode a class for data preprocessing
- Regression
- Classes for Linear and Logistic Regression
linreg support for linear regressino
logreg support for logistic regression
stepwise a variable selection class
- Tree
- Classes for Decision Tree Solutions
tree_regressor support for regressor decision trees
tree_classifier support for classification decision trees
- Forest
- Classes for Random Forests
forest_regressor support for regressor random forests
forest_classifier support for classification random forests
- NeuralNetwork
- Classes for Neural Networks
nn_regressor support for regressor neural networks
nn_classifier support for classification neural networks
- TextAnalytics
- Classes for Text Analytics
text_analysis support for topic analysis
sentiment_analysis support for sentiment analysis
- Internet
- Classes for Internet Applications
scrape support for web scrapping
metrics a class for solution metrics
Documentation and Examples
The API and documentation for all classes and examples are available at https://github.com/tandonneur/AdvancedAnalytics .
Installation and Dependencies
AdvancedAnalytics is designed to work on any operating system running python 3. It can be installed using pip or conda.
pip install AdvancedAnalytics
# or
conda install -c conda-forge AdvancedAnalytics
- General Dependencies
There are dependencies. Most classes import one or more modules from Sci-Learn, referenced as sklearn in module imports, and StatsModels. These are both installed in with current versions of anaconda, a popular application for coding python solutions.
- Decision Tree and Random Forest Dependencies
The Tree and Forest modules plot decision trees and importance metrics using pydotplus and the graphviz packages. If these are not installed and you are planning to use the Tree or Forest modules, they can be installed using the following code.
conda install -c conda-forge pydotplus conda install -c conda-forge graphviz pip install graphviz
One note, the second conda install does not complete the install of the graphviz package. To complete the graphviz install, it is necessary to run the pip install after the conda graphviz install.
- Text Analytics Dependencies
The TextAnalytics module is based on the NLTK and Sci-Learn text analytics packages. They are both installed with the current version of anaconda.
However, TextAnalytics includes options to produce word clouds, which are graphic displays of the word collections associated with topic or data clusters. The wordcloud package is used to produce these graphs. If you are using the TextAnalytics module you can install the wordcloud package with the following code.
conda install -c conda-forge wordcloud
In addition, data used by the NLTK package is not automatically installed with this package. These data include the text dictionary and other data tables.
The following nltk.download commands should be run before using TextAnalytics. However, it is only necessary to run these once to download and install the data NLTK uses for text analytics.
#The following NLTK commands should be run once to #download and install NLTK data. nltk.download(?punkt?) nltk.download(?averaged_preceptron_tagger?) nltk.download(?stopwords?) nltk.download(?wordnet?)
Code of Conduct
Everyone interacting in the AdvancedAnalytics project’s codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct: https://www.pypa.io/en/latest/code-of-conduct/ .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for AdvancedAnalytics-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dc13daa1793cad2c2c1e88b3c8e42d8aed6c366a4e930f03e10fabe5ce07e32 |
|
MD5 | 791621f5492b2b37f0f6528b845b3edf |
|
BLAKE2b-256 | c08268e2437b203ef31b22d3446233c56609e8cd7f19d9662fd993ab7d76f1da |