Cassava is a package for reading, plotting and quality-checking CSV files.
Project description
cassava
Cassava is a package for reading, plotting and quality-checking CSV files. Its primary purpose is for giving a quick, first assessment of a CSV file, highlighting common quality issues such as wholly empty columns or rows, differing column counts and basic outlier detection. The package can be integrated as part of a larger workflow, or used directly from the command line with a simple but functional command line interface (CLI).
Install
The package can be installed from PyPI (note the package distribution name):
$ pip install cassava-csv
From the command line
The cassava CLI runs in a number of modes. The main commands are plot
, to visually inspect a file, and print
, to print its findings to stdout
. For each of these commands, there are two subcommands; qc
for producing a QC plot or report, and stats
for producing a summary statistics plot or report. There are then many options to specify how to read and process the CSV file, e.g., whether it has a header row, which column to use for the x-axis in the plot, which columns to use for the y-axis in the plot etc.
Synopsis
The general usage is to call the cassava package main, followed by any options, then the command (one of plot
, print
), a subcommand (one of qc
, stats
) and finally the input CSV file. Note that the subcommand is optional and if not supplied, will default to qc
.
$ python -m cassava [opts] command [subcommand] input.csv
Specifying the --help
option, will print the CLI usage and quit. To get help on a given command, specify the --help
option after that command. For example:
$ python -m cassava plot --help
Note that the options are global to all modes (commands and subcommands), even when some only really make sense for a given mode. This is by design, and to make for an uncomplicated CLI, which lends itself well to just pressing "up arrow" in the shell and using the same options but in a different mode.
Options
-h, --help show this help message and exit
-e ENCODING, --encoding ENCODING
character encoding of the input file
-H HEADER_ROW, --header-row HEADER_ROW
row containing the header
-i FIRST_DATA_ROW, --first-data-row FIRST_DATA_ROW
first row containing data to plot
-c COMMENT, --comment COMMENT
comment character that introduces a file header
section that is to be skipped, then set header row and
first data row to follow this
-C, --common-header-row
shorthand for -H 0 -i 1, as this is such a commonplace
configuration
-x XCOL, --x-column XCOL
column containing values for the x-axis
-y YCOL, --y-column YCOL
column containing values for the y-axis (specify
multiple columns separated by commas and/or as ranges
to plot multiple curves on y-axis)
-d, --x-as-datetime treat the x-axis values as datetimes
-f DATETIME_FORMAT, --datetime-format DATETIME_FORMAT
datetime format specification
-m MISSING_VALUE, --missing-value MISSING_VALUE
value to be treated as missing data
-l DELIMITER, --delimiter DELIMITER
alternative delimiter
-s, --skip-initial-space
ignore whitespace immediately following the delimiter
-F, --forgive be forgiving when parsing numeric data
-N NCOLS, --plot-in-n-columns NCOLS
number of columns for a multi-plot grid
-k K, --tukey-fence-factor K
factor to multiply IQR by in Tukey's rule
-O, --hide-outliers don't show outliers on stats plots
-P PLOT_OPTS, --plot-options PLOT_OPTS
options for the plot, specified as a simple JSON
object
-S, --scatter-plot set plot options (see -P) to produce a scatter plot
-v, --verbose emit verbose messages
-V, --version show program's version number and exit
Examples
Given the following CSV file:
Datetime,Temperature,Relative_Humidity,Sea_Level_Pressure,Wind_Speed,,,
1999-12-20T00:00:00,-10,40,990,,,,
1999-12-21T00:00:00,-11,51,971,,,,
1999-12-22T00:00:00,-10,62,952,17,,,
1999-12-23T00:00:00,-10,56,956,,,,
1999-12-24T00:00:00,-12,78,,,,,
1999-12-25T00:00:00,-11,77,995,,,,
1999-12-26T00:00:00,-11,70,986,,,,
1999-12-27T00:00:00,-12,47,966,22,,,
1999-12-28T00:00:00,-11,48,990,230,,,
1999-12-29T00:00:00,-11,57,967,25,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,
,,,,,,,,,,,
we can see a number of issues. It's likely that this was exported from a spreadsheet, as it has a number of trailing empty columns, and trailing empty rows. In addition, there's been some subsequent "finger trouble", as the last two empty rows have differing column counts. There are wholly empty columns, as well as some sparse, but not wholly empty columns, and at least one highly probable outlier.
Often a quick plot will tell us a lot about our data, so let's try that. If our CSV file contained only numeric columns, with no header, then we could simply do:
$ python -m cassava plot data.csv
Failed to convert column 0 at row 0 with float: ['Datetime', 'Temperature', 'Relative_Humidity', 'Sea_Level_Pressure', 'Wind_Speed', '', '', '']. Cause: could not convert string to float: 'Datetime'
however in this case, the default y-axis column (0) contains ISO datetime strings, and the first row is a header row, so cassava will raise the above exception.
Note that by default, cassava will return focused, colour-coded exception messages, with the data context where appropriate. To see the full, chained exceptions including the tracebacks (as per normal Python exception reporting), run cassava in verbose mode (-v
).
At this point, we could replace the plot
command with print
, and get some useful QC information about the file, as the print
command doesn't need to interpret the data as much as the plot
command does.
A key feature of cassava is that you can run it in "forgive" mode (-F
), where it will simply replace any invalid numeric values with a placeholder value (the default is NaN). By default the forgive mode is False
, otherwise it would mask quality issues with the data, and so defeat the central point of cassava! However, in cases where we just want to get a first look at the data, the forgive mode is invaluable. So let's do that:
$ python -m cassava -F plot data.csv
This produces a plot, but it's empty! This is because the default y-axis column contains datetime strings, and so the forgive mode has skipped over all of them. Let's specify the y-axis data (-y
) as column 1:
$ python -m cassava -F -y 1 plot data.csv
No handles with labels found to put in legend.
That works, and produces a line plot of the Temperature
data. This is a minimal working command line for plotting this file, but we can do better!
Let's provide a few more options for processing. First we can address the notification that matplotlib
emitted about no labels for the legend. We can tell cassava that the first row (row 0 - all cassava coordinates have origin zero) is a header row (-H 0
) and the first data row is the next row (-i 1
). Having these as two separate options gives us flexibility for cases where the CSV file may have a complex structured header section. However, as this is such a commonplace header configuration, there is a shorthand option (-C
), which is identical to -H 0 -i 1
. Furthermore, we note that this is a timeseries, so we can provide options to cassava so that it can treat it as such. We specify the x-axis data as column 0 (-x 0
), and tell cassava to treat the x-axis as a datetime column (-d
). The datetime format is ISO 8601, which is the cassava default, so we don't need to specify the datetime format. Trying this also raises an exception:
$ python -m cassava -H 0 -i 1 -x 0 -d -y 1 plot qc data.csv
Failed to convert column 0 at row 11 with strptime: ['', '', '', '', '', '', '', '']. Cause: time data '' does not match format '%Y-%m-%dT%H:%M:%S'
This is valuable QC information, as it tells us exactly the row that failed to parse as a datetime (again, we can run in verbose mode to see the chained tracebacks). Particularly useful if the file is large. At this point we could address the issue directly, by editing the CSV file accordingly (in this example, by removing the empty trailing rows), or just run cassava without specifying the x-axis data (cassava then defaults to integer indices). Let's do the latter:
$ python -m cassava -H 0 -i 1 -y 1 plot qc data.csv
Failed to convert column 1 at row 11 with float: ['', '', '', '', '', '', '', '']. Cause: could not convert string to float: ''
More valuable QC information! The empty rows at the bottom of the file cause this exception. Let's reintroduce the forgive option. This allows us to get on with evaluating the data, but of course eventually, we will remove those empty rows.
$ python -m cassava -H 0 -i 1 -y 1 -F plot qc data.csv
This gives us a nice working command line. Now let's plot all the numeric columns. We can specify multiple columns for the y-axis by giving a comma-separated list and/or an inclusive range - the following are all equivalent:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F plot qc data.csv
$ python -m cassava -H 0 -i 1 -y 1,2,3-4 -F plot qc data.csv
$ python -m cassava -H 0 -i 1 -y 1,2-4 -F plot qc data.csv
$ python -m cassava -H 0 -i 1 -y 1-4 -F plot qc data.csv
This works, but as the Sea_Level_Pressure
values are far greater than the other columns, it's not easy to pick out the detail. We could drop the Sea_Level_Pressure
column from the y-axis list (-y 1,2,4
). This is an improvement, but the outlier in Wind_Speed
is now causing problems. In cases where your data are of greatly differing scales, it's better to plot multiple curves on separate plots. This can be achieved using the -N NCOLS
option, which tells cassava to plot a grid of NCOLS-wide plots:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -N 2 plot qc data.csv
This plots a 2x2 grid of plots, with each variable in its own plot, with a suitably-scaled y-axis.
Another case where you may find large values are obscuring the detail, is where missing values in the data have been specified by a value that is outside of the data domain - for example -999
. In such a case, we can tell cassava to treat these values as missing data and replace them with NaN. This will then mean that they are not shown when plotting the data:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -m -999 plot qc data.csv
If we have isolated points that are separated by NaN values (either direct NaN values or converted missing values), then the default plot type (a line plot) may not show these points, as matplotlib cannot draw a line segment between them. In such a case, we can ask cassava to produce a scatter plot (-S
) instead of a line plot:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -m -999 -S plot qc data.csv
This scatter-plot option is actually a convenience option for the more general purpose plot-options option (-P
), which takes a valid set of matplotlib plot options as a simple JSON object, and then converts it into a Python dict
. Note that as it's a JSON object, you must use double-quotes for the keys and any string values. For example, the scatter-plot option above can equivalently be specified as:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -m -999 -P '{"marker": ".", "ls": ""}' plot qc data.csv
For any of the working command lines above, we could replace the qc
subcommand with the stats
subcommand, to get summary statistics plots for the specified y-columns. Let's do that for the 2x2 grid command line. The first thing to note, is that the -N 2
option is not required for stats plots. However, as noted above, it doesn't hurt to leave it there and thus allows for rapid tweak/repeat cycles:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -N 2 plot stats data.csv
This produces three plots for each specified y-column: a density plot of the distribution of the data, a line plot of the data including bounds to highlight potential outliers, and a boxplot of the data. In this example, the outlier in the Wind_Speed
is clearly identified. If an outlier is so large that it dominates the remaining data, then we can instruct cassava to not show outliers (-O
), thereby revealing the detail:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -N 2 -O plot stats data.csv
Similarly, we can replace the plot
command with the print
command. Let's do that (again, no need to remove extraneous options):
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -N 2 -O print qc data.csv
Column counts:
first row 1: ncols = 8
row 14: ncols = 6
row 15: ncols = 12
Row counts:
total rows = 16, data rows = 15
Empty columns:
column 5 is empty
column 6 is empty
column 7 is empty
Empty rows:
row 11 is empty
row 12 is empty
row 13 is empty
row 14 is empty
row 15 is empty
which produces the above QC report. The output is colour-coded using a traffic light system, thereby highlighting quality issues. For the Column counts
section, only those rows which have differing column counts to the first data row are listed, so ideally in good data, you would only see the column count of the first data row. Running the above in verbose mode (-v
) would list all column counts, irrespective of whether they agree with the first data row or not.
If the encoding of the file is UTF-8 and it begins with a Byte Order Mark (BOM), then a warning is emitted at the top of the QC report:
The effective encoding is utf-8 and the input begins with an unneccessary Byte Order Mark (BOM). This BOM is present in cell (0,0) of the stored input. If this is undesirable, either remove the BOM or specify the encoding as utf-8-sig (see the --encoding option)
We can also print summary statistics for the specified columns, and list any cells that contain suspected outlier values:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -N 2 print stats data.csv
Column stats:
column min mean max q1 median q3 std
1 -12 -11 -10 -11 -11 -10 0.7
2 40 59 78 49 56 68 12
3 9.5e+02 9.7e+02 1e+03 9.7e+02 9.7e+02 9.9e+02 15
4 17 74 2.3e+02 21 24 76 90
Column outliers (1.5 * IQR):
column,row value
4,9 2.3e+02
Similarly to when plotting, we can turn off the outliers table (-O
). This is useful if there are so many outliers that the column stats table scrolls off the top of the screen:
$ python -m cassava -H 0 -i 1 -y 1,2,3,4 -F -N 2 -O print stats data.csv
Column stats:
column min mean max q1 median q3 std
1 -12 -11 -10 -11 -11 -10 0.7
2 40 59 78 49 56 68 12
3 9.5e+02 9.7e+02 1e+03 9.7e+02 9.7e+02 9.9e+02 15
4 17 74 2.3e+02 21 24 76 90
As noted above, being able to separately specify the header row and the first data row gives us flexibilty when given a CSV file that may have a complex structured header section. A fairly common use case though, is where the CSV file has an extended file header section (often not comma-separated) that is introduced by some form of comment character. As a convenience, we can tell cassava to skip over this file header section and then automatically set the column header row to be the first row following this file header section, and the first data row to be the next row. We do this by specifying a comment character (-c
).
XCSV is a file format that contains an extended file header section, introduced by the #
character and containing key/value pairs, and followed by a CSV table with a column header row and data rows.
Given the following XCSV file:
# id: 1
# title: The title
# summary: This dataset...
# authors: A B, C D
id,count
0,70
1,33
2,3
3,22
4,1
5,34
6,80
7,38
8,47
9,89
we can set the comment character (-c
) and cassava will automatically set conf['header_row'] = 4
and conf['first_data_row'] = 5
for us. Note that here, we need to quote the comment character as it is also the shell's comment character.
$ python -m cassava -c '#' print qc xcsv.csv
Column counts:
first row 5: ncols = 2
Row counts:
total rows = 15, data rows = 10
Empty columns:
Empty rows:
As noted earlier, if the file begins with an unnecessary BOM, a warning is emitted at the top of the QC report. However, when skipping a file header section, cassava silently ignores the presence of any BOM, otherwise it would fail to match the comment character on the first line of the file and so fail to process the file header section correctly.
A note on encodings
The default character set encoding used to read an input file is UTF-8. For input files that contain only numeric data, with the possible addition of datetime strings, this will normally suffice. However, if the data include text labels with characters outside of the ASCII range, then it's possible that the file was encoded using a different encoding. In such cases, the file can either be converted to UTF-8 (by using iconv
, for example), or by specifying the file's encoding on the command line with the --encoding
option. The input to this option is the encoding name (e.g., UTF-8, ISO-8859-15 etc.).
If the encoding of a file is UTF-8 and it begins with a BOM, then a warning is emitted when printing a QC report.
As an example, say a CSV file was exported from MS Excel, and the data contained some common Scandinavian characters. When exporting this, it's possible that Excel will use the Windows-1252 encoding for the data. This will cause cassava
to fail to read the data in, as it's expecting UTF-8 encoded input. Running cassava
without specifying the encoding will result in an exception similar to the following:
$ python -m cassava -C print qc data.csv
'utf-8' codec can't decode byte 0xe5 in position 62: invalid continuation byte. Specify the encoding of the file (see the --encoding option). Error occurred in the block following line number 0. Failed input data context: b'ude (degree_north)\nV\xe5gsbreen,19.7338,80.4'
Note that the exception message includes the (origin zero) line number after which the block containing the invalid byte is included. The character is present on this line or on a line following this line number. The exception message also contains some context (as a raw byte string) around the invalid byte, to aid finding it in the input file.
In such a case, the first task is to find out what the file's encoding is, and then tell cassava
to use that encoding. We can use a character set detection program, such as uchardet
, to identify the encoding:
$ uchardet data.csv
WINDOWS-1252
$ python -m cassava --encoding WINDOWS-1252 -C print qc data.csv
Using the package
To use cassava in your own code, setup a configuration dict
, and then call the required methods from within the Cassava
context manager:
from cassava import Cassava
filename = 'data.csv'
conf = Cassava.DEFAULTS.copy()
opts = {
'header_row': 0,
'first_data_row': 1,
'ycol': [1,2,3,4],
'forgive': True
}
conf.update(opts)
with Cassava(path=filename, conf=conf) as f:
f.read()
f.plot()
Configuration dict
As can be seen above, cassava requires a fully-specified configuration dict
, so the easiest way to ensure this is to take a copy of the Cassava
class DEFAULTS dict
, and then override any specific configuration items. The default configuration dict
is:
DEFAULTS = {
'header_row': None,
'first_data_row': 0,
'comment': None,
'xcol': None,
'ycol': [0],
'x_as_datetime': False,
'datetime_format': '%Y-%m-%dT%H:%M:%S',
'missing_value': None,
'delimiter': ',',
'skip_initial_space': False,
'forgive': False,
'verbose': False
}
- header_row: Integer row index of the input file's header
- first_data_row: Integer row index of the input file's first data row
- comment: Character that introduces a file header section to be skipped
- xcol: Integer column index from the input file for the plot x-axis
- ycol: Integer column index
list
from the input file for the plot y-axis - x_as_datetime: Consider the x-axis data as datetime strings
- datetime_format:
datetime.datetime.strptime()
format specification - missing_value: Value in the data that indicates a missing datum (e.g. -999)
- delimiter: Column delimiter character
- skip_initial_space: Skip any spaces following the delimiter character
- forgive: Forgive mode. Replace invalid numeric values with placeholder (NaN)
- verbose: Print extra messages in
print
mode methods
Note that all cassava column/row coordinates have origin zero.
Persist custom default configuration for a session
A feature of the DEFAULTS dict
is that, because it's a class variable (rather than an instance variable), you can set per import
session defaults for all instances by directly updating DEFAULTS
:
from cassava import Cassava
filenames = ['data1.csv','data2.csv']
opts = {
'header_row': 0,
'first_data_row': 1
}
Cassava.DEFAULTS.update(opts)
for filename in filenames:
with Cassava(path=filename) as f:
f.read()
f.plot()
Note that if no configuration dict
is passed via the conf
kwarg, then cassava uses DEFAULTS
instead.
The cassava object
The Cassava
object has the following instance variables:
self.path = path
self.mode = mode
self.encoding = encoding
self.conf = conf or self.DEFAULTS
self.fp = None
self.header_row = []
self.rows = []
- path: The input file path (
str
) - mode: The input file open mode (
str
) - encoding: The input file character set encoding (
str
) - conf: The configuration for the input file (
dict
) - fp: The file pointer for the input file (
file
object) - header_row: The (optional) header row, parsed from the input data (
list
) - rows: All rows parsed from the input data (
list
oflist
s)
Reading input data
The easiest way to get data into cassava, is by using it as a context manager:
with Cassava(path=filename, conf=conf) as f:
f.read()
f.plot_stats()
however, you can use the open/close
methods directly, should you wish:
c = Cassava(conf=conf)
c.open(path=filename)
c.read()
c.print_qc()
c.close()
As noted in the previous section, the header_row
and rows
attributes hold the header row and all rows parsed from the input data. Hence you could also directly set rows
(and optionally header_row
), rather than reading them from a file using the read
method, and still be able to make use of the methods that cassava provides.
Reading input data with different delimiters
Although by default, cassava is setup to read CSV data, it can actually read any similarly-delimited tabular data. This is controlled by the delimiter
configuration item. For instance a space (conf['delimiter'] = ' '
) or a tab (conf['delimiter'] = '\t'
). Note that if the columns are separated by multiple spaces (e.g. a fixed width format), then setting conf['skip_initial_space'] = True
will consume all spaces between the columns.
Analysing the data
Once we have the data in cassava, we can produce quick-look plots, generate QC reports and compute summary statistics.
Plot the data
To give an initial exploratory look at the data and assess any QC issues, we can simply plot our dependent variables (our ycol
columns) by calling plot()
:
f.conf['ycol'] = [1,2,3,4]
f.plot()
Our input file may contain an independent variable (e.g. datetime) which we can use for our xcol
column:
f.conf['xcol'] = 0
f.conf['ycol'] = [1,2,3,4]
f.plot()
If our dependent variables contain data with differing scales, we can choose to plot them as separate subplots, rather than all on a single plot with a shared y-axis. We do this by providing a layout
tuple specifying an m
rows by n
columns plot grid (where m * n > 1
):
f.conf['ycol'] = [1,2,3,4]
f.plot(layout=(2,2))
Often we don't care how many rows the plot grid requires, rather just how many columns wide the grid should be, so we can use the compute_multi_plot_layout
method to compute our layout
for us, given our desired number of columns:
f.conf['ycol'] = [1,2,3,4]
f.plot(layout=f.compute_multi_plot_layout(2))
which will produce a 2x2 plot grid.
To allow some customisation of the plot, we can use the opts
kwarg. This takes a dict
, and can contain any plot kwargs accepted by matplotlib
:
f.plot(opts={'lw': 4, 'c': 'green', 'ls': '--'})
In addition, we can defer the showing of the plot via the show
kwarg. This allows us to do further customisation of the plot before showing it. Note that to do this, we have to import matplotlib
and call show()
ourselves:
import matplotlib.pyplot as plt
fig, axs = f.plot(show=False)
fig.suptitle(f'Meteorological Data - {filename}')
plt.show()
Plot summary statistics for the data
Plotting summary statistics is straightforward:
f.conf['ycol'] = [1,2,3,4]
f.plot_stats()
This produces three plots for each specified y-column: a density plot of the distribution of the data, a line plot of the data including bounds to highlight potential outliers, and a boxplot of the data.
Note that the bounds in the line plot, and the whiskers in the boxplot, are scaled according to Tukey's rule and are 1.5 * IQR above/below the IQR. We can change this scale factor via the k
kwarg:
f.plot_stats(k=2.5)
If we have outliers that are so large that they dominate the remaining data, then we can choose to omit them, thereby revealing the detail:
f.plot_stats(showfliers=False)
By default, the number of bins to use for the density plot is automatically calculated by matplotlib
. We can specify these explicitly via the bins
kwarg, which takes any form supported by matplotlib.pyplot.hist
:
f.plot_stats(bins=10)
Print a QC report of the data
To print a QC report of the data to stdout
, we can do:
filename = 'data.csv'
with Cassava(path=filename) as f:
f.read()
f.print_qc()
which outputs the following (using our example data from above):
Column counts:
first row 0: ncols = 8
row 14: ncols = 6
row 15: ncols = 12
Row counts:
total rows = 16, data rows = 16
Empty columns:
column 5 is empty
column 6 is empty
column 7 is empty
Empty rows:
row 11 is empty
row 12 is empty
row 13 is empty
row 14 is empty
row 15 is empty
Note that this really requires no configuration, but assumes that our input file has no header row (and so total rows
== data rows
), or more generally, that the data begin on the first row (row 0). If the file does contain a header row, or an extended header section, then we should specify the first data row index for completeness, and so generate a more accurate report:
f.conf['first_data_row'] = 1
f.print_qc()
The print_qc
method is a wrapper around the methods that produce the different output sections in the above output. Of course, we can call these directly to produce a custom report. For example:
f.print_column_counts()
f.print_empty_columns()
Print a summary statistics report of the data
Here we require a little more configuration than for a QC report. We need to specify those columns that we wish to compute summary statistics for, and, taking our example data above, we'd need to run cassava in forgive mode to skip over our trailing empty rows:
f.conf['first_data_row'] = 1
f.conf['ycol'] = [1,4]
f.conf['forgive'] = True
f.print_stats()
which outputs the following (again using our example data from above):
Column stats:
column min mean max q1 median q3 std
1 -12 -11 -10 -11 -11 -10 0.7
4 17 74 2.3e+02 21 24 76 90
Column outliers (1.5 * IQR):
column,row value
4,9 2.3e+02
Here we can see that cassava has identified the value in column 4 (Wind_Speed
) and row 9 as an outlier, according to Tukey's rule.
Access the underlying QC and summary statistics data
Producing the plots and printing the reports is fine, but for tighter integration, we can access the underlying message dict
that encapsulates the QC and statistics information.
Message dict
The general form of the message dict
is:
msg = {
'x': x_coordinate,
'y': y_coordinate,
'data': data_dict,
'status': status_code
}
- x: Integer column index of the input data (
None
indicates all columns) - y: Integer row index of the input data (
None
indicates all rows) - data: A
dict
of the output, specific to each method - status: One of the possible status constants from
class CassavaStatus
Note that all cassava column/row coordinates have origin zero.
For example, the check_column_counts
method calculates the column count of each row in the input data. In particular, it calculates and holds the column count of the configured first data row, and then compares the column counts of all subsequent rows to that of the first, and flags up if they differ. As this is a per-row check, the x-coordinate is None
because it refers to all columns for that row. The column count for the first data row will always have status = CassavaStatus.ok
:
msg = {
'x': None,
'y': y,
'data': {'is_first_row': True, 'ncols': len(row)},
'status': CassavaStatus.ok
}
Subsequent rows will have is_first_row = False
, and status = CassavaStatus.ok
if their column counts agree with that of the first data row, otherwise they will have status = CassavaStatus.error
. (It is the status
attribute that is used for colour-coding the output in the print functions.)
The following describe the message['data'] dict
for each method:
- check_bom: {'has_bom': Boolean}
- check_column_counts: {'is_first_row': Boolean, 'ncols': Integer column count}
- check_empty_columns: {'is_empty': Boolean}
- check_empty_rows: {'is_empty': Boolean}
- compute_column_stats: {'min': Minimum, 'mean': Mean, 'max': Maximum, 'q1': Quartile1, 'median': Quartile2, 'q3': Quartile3, 'std': Standard deviation}
- check_column_outliers_iqr: {'value': Cell value}
Examples
As an example, say we wanted to compute the percentage of empty columns and rows in our example file. We could do:
from cassava import Cassava
filename = 'data.csv'
with Cassava(path=filename) as f:
f.read()
ncols = len(f.rows[f.conf['first_data_row']])
# Uncomment to only consider data rows, rather than total rows
# f.conf['first_data_row'] = 1
# nrows = len(f.rows) - f.conf['first_data_row']
nrows = len(f.rows)
empty_ncols = 0
empty_nrows = 0
for msg in f.check_empty_columns():
if msg['data']['is_empty']:
empty_ncols += 1
for msg in f.check_empty_rows():
if msg['data']['is_empty']:
empty_nrows += 1
print(f'Empty cols = {empty_ncols / ncols * 100:0.2f}%')
print(f'Empty rows = {empty_nrows / nrows * 100:0.2f}%')
To print the mean of each configured column, we could do:
filename = 'data.csv'
conf = Cassava.DEFAULTS.copy()
opts = {
'header_row': 0,
'first_data_row': 1,
'ycol': [1,2,3,4],
'forgive': True
}
conf.update(opts)
with Cassava(path=filename, conf=conf) as f:
f.read()
labels = f.get_column_labels_from_header(f.conf['ycol'])
for msg, label in zip(f.compute_column_stats(), labels):
print(f'{label} mean = {msg["data"]["mean"]:0.2f}')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cassava_csv-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f072efbb192e9b818b391c35e714598efde86b24e2d1a7b188d8e90a43d3bff |
|
MD5 | e082726ff198482a033b1c8b0dec885d |
|
BLAKE2b-256 | f374b70139ed286d936311fded3360ef5d285d775813806b7f82ff25026b5c95 |