Data Story Pattern Analysis for LOSD
Project description
DataStoryPatternLibrabry
Data Story Patterns Library is a repository with pattern analysis designated for Linked Open Statistical Data. Story Patterns were retrieved from literture reserach udenr general subject of "data journalism".
Installation
pip install datastories
Requirements will be automatically installed with package
###Import/Usage
import datastories.analytical as patterns
patterns.DataStoryPattern(sparqlendpointurl, jsonmetadata)
Object created allow to query SPARQL endpoint based on JSON meatadat provided.
JSON Template
{
"cube_key" : {
"title":"title of cube",
"dataset_structure":"URI for cube structure",
"dimensions":{
"dimension_key":{
"dimension_title":"Title of diemnsion",
"dimension_url":"URI for dimension",
"dimension_prefix":"URI for dimension's values"
},
"dimension_key":{
"dimension_title":"Title of diemnsion",
"dimension_url":"URI for dimension",
"dimension_prefix":"URI for dimension's values"
}
},
"hierarchical_dimensions":{
"dimension_key":{
"dimension_title":"Title of diemnsion",
"dimension_url":"URI for dimension",
"dimension_prefix":"URI for dimension's values",
"dimension_levels":
{
"level_key":"integer(granularity level)",
"level_key":"integer(granularity level)"
}
}
},
"measures":{
"measure_key":{
"measure_title":"Title of measure",
"measure_url":"URI for measure"
}
}
}
}
Patterns Description
- Measurement and Counting
- League Table
- Internal Comprison
- Profile Outliers
- Dissect Factors
- Highlight Contrast
- Start Big Drill Down
- Start Small Zoom Out
- Analysis By Category
- Explore Intersection
- Narrating Change Over Time
MCounting
Measurement and Counting Arithemtical operators applied to whole dataset - basic information regarding data
Attributes
def MCounting(self,cube="",dims=[],meas=[],hierdims=[],count_type="raw",df=pd.DataFrame() )
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
count_type | String |
Type of Count to perform |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
Output
Based on count_type value
Count_type | Description |
---|---|
raw | data without any analysis performed |
sum | sum across all numeric columns |
mean | mean across all numeric columns |
min | minimum values from all numeric columns |
max | maximum values from all numeric columns |
count | amount of records |
LTable
LeagueTable - sorting and extraction specific amount of records
Attributes
def LTable(self,cube=[],dims=[],meas=[],hierdims=[], columns_to_order="", order_type="asc", number_of_records=20,df=pd.DataFrame())
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
columns_to_order | list[String] |
Set of columns to order by |
order_type | String |
Type of order (asc/desc) |
number_of_records | Integer |
Amount of records to retrieve |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
Output
Based on sort_type value
Sort_type | Description |
---|---|
asc | ascending order based on columns provided in columns_to_order |
desc | descending order based on columns provided in columns_to_order |
InternalComparison
InternalComparison - comparison of numeric values related to textual values within one column
Attributes
def InternalComparison(self,cube="",dims=[],meas=[],hierdims=[],df=pd.DataFrame(), dim_to_compare="",meas_to_compare="",comp_type="")
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
dim_to_compare | String |
Dimension, which values will be investigated |
meas_to_compare | String |
Measure, which numeric values related to dim_to_compare will be processed |
comp_type | String |
Type of comparison to perform |
Output
Independent from comp_type
selected, output data will have additional column with numerical column meas_to_compare
processed in specific way.
Available types of comparison comp_type
Comp_type | Description |
---|---|
diffmax | difference with max value related to specific textual value |
diffmean | difference with arithmetic mean related to specific textual values |
diffmin | difference with minimum value related to specific textual value |
ProfileOutliers
ProfileOutliers - detection of unusual values within data (anomalies)
Attributes
def ProfileOutliers(self,cube=[],dims=[],meas=[],hierdims=[],df=pd.DataFrame(), displayType="outliers_only")
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
displayType | String |
What information display are bound to display (with/without anomalies) |
Output
Pattern analysis using python scipy
library will perform quick exploration in serach of unusual values within data.
Based on displayType
parameter data will be displayed with/without ddetected unusual values.
Available types of displaying displayType
DisplayType | Description |
---|---|
outliers_only | returns rows from dataset where unusual values were detected |
without_outliers | returns dataset with excluded rows where unusual values were detected |
DissectFactors
DissectFactors - decomposition of data based on values in dim_to_dissect
Attributes
def DissectFactors(self,cube="",dims=[],meas=[],hierdims=[],df=pd.DataFrame(),dim_to_dissect="")
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
dim_to_dissect | String |
Based on which dimension data should be decomposed |
Output
As an output, data will be decomposed in a form of a dictionary, where each subset have values only related to specific value.
Dictionary of subdataset will be constructed as a series of paiers where key per each susbet will values from dim_to_dissect
and this key value will be data, where yhis key value was occurring.
HighlightContrast
HighlightContrast - partial difference within values related to one textual column
Attributes
def HighlightContrast(self,cube="",dims=[],meas=[],hierdims=[],df=pd.DataFrame(),dim_to_contrast="",contrast_type="",meas_to_contrast="")
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
dim_to_contrast | String |
Textual column, from which values will be contrasted |
meas_to_contrast | String |
Numerical column, which values are contrasted |
contrast_type | String |
Type of contrast to present |
Output
Independent from contrast_type
selected, output data will have additional column with numerical column meas_to_contrast
processed in specific way.
Available types of comparison contrast_type
Contrast_type | Description |
---|---|
partofwhole | difference with max value related to specific textual value |
partofmax | difference with arithmetic mean related to specific textual values |
partofmin | difference with minimum value related to specific textual value |
StartBigDrillDown
StartBigDrillDown - data retrieval from multiple hierachical levels.
This pattern can be only applied to data not stored already in DataFrame
Attributes
def StartBigDrillDown(self,cube="",dims=[],meas=[],hierdim_drill_down=[])
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdim_drill_down | dict{hierdim:list[str]} |
Hierarchical dimension with list of hierarchy levels to inspect |
Output
As an output, data will be retrieved in a form of a dictionary, where each dataset will be retrieved from different hierachy level. List will be provided inhierdim_drill_down
. Hierachy levels provided by in parameter will automatically sorted in order from most general to most detailed level based on metadata provided.
StartSmallZoomOut
StartSmallZoomOut - data retrieval from multiple hierachical levels.
This pattern can be only applied to data not stored already in DataFrame
Attributes
def StartSmallZoomOut(self,cube="",dims=[],meas=[],hierdim_zoom_out=[])
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdim_zoom_out | dict{hierdim:list[str]} |
Hierarchical dimension with list of hierarchy levels to inspect |
Output
As an output, data will be retrieved in a form of a dictionary, where each dataset will be retrieved from different hierachy level. List will be provided inhierdim_zoom_out
. Hierachy levels provided by in parameter will automatically sorted in order from most detaile to most general level based on metadata provided.
AnalysisByCategory
AnalysisByCategory - ecomposition of data based on values in dim_for_category with analysis performed on each susbet
Attributes
def AnalysisByCategory(self,cube="",dims=[],meas=[],hierdims=[],df=pd.DataFrame(),dim_for_category="",meas_to_analyse="",analysis_type="min"):
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
dim_for_category | String |
Dimension, based on which input data will be categorised |
meas_to_analyse | String |
Measure, which will be analysed |
analysis_type | String |
Type of analysis to perform |
Output
As an output, data will be decomposed in a form of a dictionary, where each subset have values only related to specific value. Such subset will get analysed based on analysis_type
parameter
Available types of analysis analysis_type
Analysis_type | Description |
---|---|
min | Minimum per each category |
max | Maximum per each category |
mean | Arithmetical mean per each category |
sum | Total value from each category |
ExploreIntersection
Attributes
def ExploreIntersection(self, dim_to_explore=""):
Parameter | Type | Description |
---|---|---|
dim_to_explore | String |
Dimension, which existence within enpoint is going to be investigated |
Output
Pattern will return series of datasets, where each will represent occurence of dim_to_explore
in one cube
NarratingChangeOverTime
Attributes
def NarrChangeOT(self,cube="",dims=[],meas=[],hierdims=[],df=pd.DataFrame(),meas_to_narrate="",narr_type="")
Parameter | Type | Description |
---|---|---|
cube | String |
Cube, which dimensions and measures will be investigated |
dims | list[String] |
List of dimensions (from cube) to take into investigation |
meas | list[String] |
List of measures (from cube) to take into investigation |
hierdims | dict{hierdim:{"selected_level":[value]}} |
Hierarchical Dimesion with selected hierarchy level to take into investigation |
df | DataFrame |
DataFrame object, if data is already retrieved from endpoint |
meas_to_narrate | String |
Set of 2 measures, which change will be narrated |
narr_type | String |
Type of narration to perform |
Output
Independent from narr_type
selected, output data will have additional column with numerical values processed in specific way.
Available types of analysis narr_type
Narr_type | Description |
---|---|
percchange | Percentage change between first nad second property |
diffchange | Quantitive change between first and second property |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file datastories-0.3.tar.gz
.
File metadata
- Download URL: datastories-0.3.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0caec39626844d01db960560ddca4081f6f4e5f073c8630b6ca4cbc8640d5bff |
|
MD5 | f46f2dd881cd8e74c9dd370f8d76e9e7 |
|
BLAKE2b-256 | 44b4bb9c92c6c9e65b125e0ce61cb9e117c089f9e21d64c9c99911287425f4ed |