Automatically generate a pdf report containing feature importance, baseline modelling, spurious correlation detection, and more, from a single command line input
Project description
The Data FIBS Reporter
The Data Feature Importance, Baseline-modeller and Spurious correlation) Reporter (FIBS) is an open-source software for automatic generation of a PDF report to highlight and visualise potential sources of spurious correlation within any given tabular or audio dataset stored as a Comma Separated Values (CSV) file. FIBS is run through one-command line command; all of the calculations, model training, and report generation happen automatically.
All that is required as input on the command line is the path to the CSV file containing the data, and the name of the output (dependent) variable within the dataset. The toolkit will automatically determine whether the task is regression or classification. Optionally, the toolkit can process and extract audio data, provided the name of the variable within the CSV that contains the audio file for each observation is specified,
Key features that are generated automatically:
- A traffic light score for potential spurious correlations within the dataset
- Calculation of four different feature importance metrics to highlight the most important features within the given dataset
- Training and evaluation of two baseline models, including visualisation of model results
- Visuals of the most important features, with different visuals depending on the variable types
- Automatic determination of regression or classification task, resulting in different baseline models, feature extraction methods, and visualisations
- Principal Component Analysis calculation and baseline model to estimate complexity within the dataset
- (Optionally) extract audio data features and run the above on these features
- Output all of the above in a PDF report with accompanying dynamic textual explanations
Installation
The easiest way to install FIBS is by running the pip command:
pip install fibs-reporter
Note that FIBS requires Python ≥ V. 3.7
Usage
It is assumed that the data to be analysed is contained within a CSV file, where each column contains a variable, and each row contains an observation.
If the CSV file path is path_to_my_csv_file and the output variable is called target then the following command line command will output a report called data_report.pdf in the current working directory:
fibs path_to_my_csv_file target --report_name data_report
If you wish to use an audio dataset, then a column in the CSV file must contain the file path to the raw WAV audio recording for each observation. If this column is called filename then the following command line command would be used, where the other inputs are as above:
fibs path_to_csv_file target -audio filename --report_name data_report
Another command line argument, --test
can be used to specify an additional CSV file which is treated as the test dataset. This CSV file must contain the same variables as the other CSV file, and is used to evaluate baseline model performance. Feature importance scores are calculated on the original CSV (training set) only.
For example, if the test CSV is stored at _path_to_test_csv, and the other arguments are as above in the audio case, then the following command line input would be used:
fibs path_to_train_csv target -audio filename --test path_to_test_csv --report_name data_report
There are a few other optional command line arguments. The full list of optional command line arguments is as follows:
--help
, displays a full list and description of required and optional command line arguments
--audio_file_col_name AUDIO_COLUMN_NAME
or -audio AUDIO_COLUMN_NAME
used to specify the name of the column in the CSV file containing the filepaths to raw audio WAV recordings for each observation.
--audio_size AUDIO_SIZE
, where AUDIO_SIZE
must be one of either "small" or "large". Must be used in conjunction with the -audio
command. Used to control the number of audio features extracted by openSMILE. By default, this is equal to "large", meaning that over 6000 audio features are extracted from each audio file. If "small" is specified, only 64 features will be extracted from each audio file.
--test TEST_CSV_FILE_PATH
used to specify the filepath to a test CSV file. Test CSV file must contain identical column names to the original CSV file. Data in the test CSV file will be used to evaluate baseline model performance. If not provided, the original CSV is shuffled and split 80/20 for a train/test split.
--report_name REPORT_NAME
used to name the PDF file output by the software. For example
--report_name data_report
would name the file 'data_report.pdf'
--max_categories MAX_CATEGORIES
or -max_cat MAX_CATEGORIES
. MAX_CATEGORIES must be a positive integer, and is used to specify the maximum number of categories in integer categorical variables. By default, any integer variable that has more than 50 unique values will be assumed to be a continuous variable.
Acknowledgments
FIBS makes use of the following open-source Python packages:
audiofile
for extracting sample rate of audio recordingsfpdf2
for building the PDF reportmatplotlib & seaborn
for creating graphs for reportnumpy
for general data handlingopenSMILE
for extracting complex audio featurespandas
for loading the CSV files and general data handlingpillow
for converting the graphs for images to use in the reportpydub
for extracting loudness and duration audio featuresscipy
for calculating test statistics for feature rankingscikit-learn
for variable encoding, model training and feature ranking, model evaluation, and PCAxgboost
for XGBoost modelling and feature ranking
License
The software has been released under a free for non-commercial use license. For details, please consult the LICENSE.txt file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fibs-reporter-0.2.2.tar.gz
.
File metadata
- Download URL: fibs-reporter-0.2.2.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fcc2a996c1ec5ddea631cc36c6d6394e2f6577673419d9bf3e28ae7368b7e9f |
|
MD5 | 8379481f8e3742ad11bf1063a851d071 |
|
BLAKE2b-256 | 4e8398260362e9c671bf3c9b94573b13062d0f01deb009d4c0a25c86575c50ea |
File details
Details for the file fibs_reporter-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: fibs_reporter-0.2.2-py3-none-any.whl
- Upload date:
- Size: 42.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 828a535b83936164a8f7283726eb491e6ef8bb82c6c572df14f57272346b61df |
|
MD5 | 0fac69b598b69903918fb3af66843e3f |
|
BLAKE2b-256 | b17dd53a6ea662a412e96cc9071e6d38f1cc2150ca1fca02b82d35e4f886af49 |