package for detecting change in time-series data
Project description
Change detection in prescribing data
Detects changes in time series with a python wrapper around the R package gets (https://cran.r-project.org/web/packages/gets/index.html). Uses a combination of Google BigQuery and Python to query data, which is then fed to the R change detection code. Outputs a table containing results.
Installation
pip install change_detection
Anaconda users may have to conda install rpy2
and conda install geopandas
if not already installed.
Usage
See https://github.com/ebmdatalab/change_detection/blob/master/examples/examples.ipynb for examples of use.
Data flow
- Get data, by:
- using a csv in
data/<name>
, which must have only the fieldscode
,month
,numerator
anddenominator
- creating a BigQuery SQL query in the same folder as the notebook that you're using, query must produce a table with only the fields
code
,month
,numerator
anddenominator
- querying any number of the OpenPrescribing measures in BigQuery
- using a csv in
- Reshapes data with Pandas
- Splits data into chunks and passes each chunk to the R change detection code
- The resulting output is then extracted with further R code
- The R outputs are then concatenated
Options
name
specifies either the name of the custom SQL file, or the name of the BigQuery measure to be queriedverbose
makes the R output more verbose to help with bug fixing default = Falsesample
for testing purposes, takes a random sample of 100 entities, to reduce processing time default = Falsemeasure
specifies that thename
specified refers to a measure, rather than custom SQL default = Falsedirection
specifies which direction to look for changes, may be'up'
,'down'
, or'both'
, default = 'both'use_cache
passes theuse_cache
option tobq.cached_read
default = Truecsv_name
to specify a .csv file to be used in the change detection, rather than getting the data from BigQueryoverwrite
forces reprocessing of the change detection, default behaviour is to not re-run if the output files exist default = Falsedraw_figures
draw an R plot for each of the time-series, along with plotting regression lines/breaks. These are stored in thefigures
folder. Options are'no'
or'yes'
default = 'no'
Output table
Timing Measures
is.tfirst
First negative break
is.tfirst.pknown
First negative break after a known intervention date
is.tfirst.pknown.offs
First negative break after a known intervention date not offset by a XX% increase
is.tfirst.offs
First negative break not offset by a XX% increase
is.tfirst.big
Steepest break as identified by is.slope.ma
Slope Measures
is.slope.ma
Average slope over steepest segment contributing at least XX% of total drop
is.slope.ma.prop
Average slope as proportion to prior level
is.slope.ma.prop.lev
Percentage of the total drop the segment used to evaluate the slope makes up
Level Measures
is.intlev.initlev
Pre-drop level
is.intlev.finallev
End level
is.intlev.levd
Difference between pre and end level
is.intlev.levdprop
Proportion of drop
Requirements
Python with an associated install of R. Python dependencies should be dealt with on installation (though for my install, I had to install rpy2 separately. R packages should be installed with the package is first loaded.
Python installation requires:
- ebmdatalab library https://github.com/ebmdatalab/datalab-pandas
- rpy2 (to install R and the below libraries)
- pandas
- pandas-gbq
- numpy
R installation requires:
- zoo
- caTools
- gets
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for change_detection-0.3.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e98002e2ea809993f607b338c73097d8495aa9ac03de4ff0a6c8b6d24936937d |
|
MD5 | fdc8ad48e7cc8cb775dcfe2d7399de50 |
|
BLAKE2b-256 | 775896cfcc6f22266be5fd70d3116f5eb455bc49eed0900b28bc5fe3a2b76abf |