Learning Point Processes Using Deep Granger Nets
Project description
- Info:
See <https://arxiv.org/abs/1406.6651> for theoretical background
- Author:
ZeD@UChicago <zed.uchicago.edu>
- Description:
Implementation of the Deep Granger net inference algorithm, described in https://arxiv.org/abs/1406.6651, for learning spatio-temporal stochastic processes (point processes). cynet learns a network of generative local models, without assuming any specific model structure.
sudo apt-get install python3-tk
Usage:
from cynet import cynet
from cynet.cynet import uNetworkModels as models
from viscynet import viscynet as vcn
- cynet module includes:
cynet
viscynet
bokeh_pipe
cynet library classes:
spatioTemporal
uNetworkModels
simulateModels
- class spatioTemporal
Utilities for spatial-temporal analysis
- Attributes:
log_store (Pickle): Pickle storage of class data & dataframes
log_file (string): path to CSV of legacy dataframe
ts_store (string): path to CSV containing most recent ts export
DATE (string):
EVENT (string): column label for category filter
coord1 (string): first coordinate level type; is column name
coord2 (string): second coordinate level type; is column name
coord3 (string): third coordinate level type; (z coordinate)
end_date (datetime.date): upper bound of daterange
freq (string): timeseries increments; e.g. D for date
columns (list): list of column names to use; requires at least 2 coordinates and event type
types (list of strings): event type list of filters
value_limits (tuple): boundaries (magnitude of event above threshold)
grid (dictionary or list of lists): coordinate dictionary with respective ranges and EPS value OR custom list of lists of custom grid tiles as [coord1_start, coord1_stop, coord2_start, coord2_stop]
grid_type (string): parameter to determine if grid should be built up from a coordinate start/stop range (‘auto’) or be built from custom tile coordinates (‘custom’)
threshold (float): significance threshold
Methods:
__init__(self, log_store='log.p', log_file=None, ts_store=None, DATE='Date', year=None, month=None, day=None, EVENT='Primary Type', coord1='Latitude', coord2='Longitude', coord3=None, init_date=None, end_date=None, freq=None, columns=None, types=None, value_limits=None, grid=None, threshold=None) fit(self, grid=None, INIT=None, END=None, THRESHOLD=None, csvPREF='TS', auto_adjust_time=False,incr=6,max_incr=24, poly_tile=False): Fit dataproc with specified grid parameters and create timeseries for date boundaries specified by INIT, THRESHOLD, and END or input list of custom coordinate boundaries which do NOT have to match the arguments first input to the dataproc Inputs - grid (dictionary or list of lists): coordinate dictionary with respective ranges and EPS value OR custom list of lists of custom grid tiles as [coord1_start, coord1_stop, coord2_start, coord2_stop] INIT (datetime.date): starting timeseries date END (datetime.date): ending timeseries date THRESHOLD (float): significance threshold auto_adjust_time (boolean): if True, within increments specified (6H default), determine optimal temporal frequency for timeseries data incr (int): frequency increment max_incr (int): user-specified maximum increment poly_tile(boolean): whether or not tiles define polygons Outputs - (No output) grid pd.Dataframe written out as CSV file to path specified getTS(self, _types=None, tile=None, freq=None): Given location tile boundaries and type category filter, creates the corresponding timeseries as a pandas DataFrame (Note: can reassign type filter, does not have to be the same one as the one initialized to the dataproc) Inputs: _types (list of strings): list of category filters tile (list of floats): location boundaries for tile freq (string): intervals of time between timeseries columns poly_tile (boolean): whether or not input for tiles defines a polygon filter Outputs: pd.Dataframe of timeseries data to corresponding grid tile pd.DF index is stringified LAT/LON boundaries with the type filter included get_rand_tile(tiles=None,LAT=None,LON=None,EPS=None,_types=None): Picks random tile from options fed into timeseries method which maps to a non-empty subset within the larger dataset Inputs - LAT (float or list of floats): singular coordinate float or list of coordinate start floats LON (float or list of floats): singular coordinate float or list of coordinate start floats EPS (float): coordinate increment ESP _types (list): event type filter; accepted event type list tiles (list of lists): list of tiles to build Ex:(list of [lat1 lat2 lon1 lon2]) or tuples (i.e. [(x1,y1),(x2,y2)]) defining polygons poly_tile (boolean): whether input for tile specifies a polygon Outputs - tile dataframe (pd.DataFrame) get_opt_freq(df,incr=6,max_incr=24): Returns the optimal frequency for timeseries based on highest non-zero to zero timeseries event count Input - df (pd.DataFrame): filtered subset of dataset corresponding to random tile from get_rand_tile incr (int): frequency increment max_incr (int): user-specified maximum increment Output - (string) to pass to pd.date_range(freq=) argument getGrid(self): Returns the tile coordinates of the working as a list of lists Input - (No inputs) Output - TILE (list of lists): the grid tiles pull(self, domain='data.cityofchicago.org', dataset_id='crimes', token=None, store=True, out_fname='pull_df.p', pull_all=False): Pulls new entries from datasource Input - domain (string): Socrata database domain hosting data dataset_id (string): dataset ID to pull token (string): Socrata token for increased pull capacity; Note: Requires Socrata account store (boolean): whether or not to write out new dataset pull_all (boolean): pull complete dataset instead of just updating Output - None (writes out files if store is True and modifies inplace) timeseries(self, LAT=None, LON=None, EPS=None,_types=None,CSVfile='TS.csv', THRESHOLD=None,tiles=None,incr=6,max_incr=24, poly_tile=False): Creates DataFrame of location tiles and their respective timeseries from input datasource with significance threshold THRESHOLD latitude, longitude coordinate boundaries given by LAT, LON and EPS or the custom boundaries given by tiles calls on getTS for individual tile then concats them together Input - LAT (float or list of floats): singular coordinate float or list of coordinate start floats LON (float or list of floats): singular coordinate float or list of coordinate start floats EPS (float): coordinate increment ESP _types (list): event type filter; accepted event type list CSVfile (string): path to output file tiles (list of lists): list of tiles to build (list of [lat1 lat2 lon1 lon2]) auto_adjust_time (boolean): if True, within increments specified (6H default), determine optimal temporal frequency for timeseries data incr (int): frequency increment max_incr (int): user-specified maximum increment poly_tile (boolean): whether or tiles define polygons Output: No Output grid pd.Dataframe written out as CSV file to path specified
- Utility functions for spatioTemporal:
splitTS(TSfile, csvNAME='TS1', dirname='./', prefix='@', BEG=None, END=None, VARNAME='') Utilities for spatio temporal analysis Writes out each row of the pd.DataFrame as a separate CSVfile For XgenESeSS binary Inputs - TSfile (pd.DataFrame): DataFrame to write out csvNAME (string): output filename dirname (string): directory for output file prefix (string): prefix for files VARNAME (string): string to append to file names BEG (datetime): start date END (datetime): end date Outputs - (No output) stringify(List): Utility function Converts list into string separated by dashes or empty string if input list is not list or is empty Input: List (list): input list to be converted Output: (string) to_json(pydict, outFile): Writes dictionary json to file Input - pydict (dict): ditionary to store outFile (string): name of outfile to write json to Output - (No output but writes out files) readTS(TSfile,csvNAME='TS1',BEG=None,END=None): Utilities for spatio temporal analysis Reads in output TS logfile into pd.DF and outputs necessary CSV files in XgenESeSS-friendly format Input - TSfile (string or list of strings): filename of input TS to read or list of filenames to read in and concatenate into one TS csvNAME (string) BEG (string): start datetime END (string): end datetime Output - dfts (pandas.DataFrame)
- class uNetworkModels:
Utilities for storing and manipulating XPFSA models inferred by XGenESeSS
- Attributes:
jsonFile (string): path to json file containing models
Methods defined here:
__init__(self, jsonFILE):
append(self,pydict):
Utilities for storing and manipulating XPFSA models
inferred by XGenESeSS
append models to internal dictionary
augmentDistance(self):
Utilities for storing and manipulating XPFSA models
inferred by XGenESeSS
Calculates the distance between all models and stores
them under the
distance key of each model;
No I/O
select(self,var="gamma",n=None,
reverse=False, store=None,
high=None,low=None,equal=None,inplace=False):
Utilities for storing and manipulating XPFSA models
inferred by XGenESeSS
Selects the N top models as ranked by var specified value
(in reverse order if reverse is True)
Inputs -
var (string): model parameter to rank by
n (int): number of models to return
reverse (boolean): return in ascending order (True)
or descending (False) order
store (string): name of file to store selection json
high (float): higher cutoff
equal (float): choose models with selection values
equal to the given value
low (float): lower cutoff
inplace (bool): update models if true
Output -
(dictionary): top n models as ranked by var
in ascending/descending order
setVarname(self):
Utilities for storing and manipulating XPFSA models
inferred by XGenESeSS
Extracts the varname for src and tgt of
each model and stores under src_var and tgt_var
keys of each model;
No I/O
to_json(outFile):
Utilities for storing and manipulating XPFSA models
inferred by XGenESeSS
Writes out updated models json to file
Input -
outFile (string): name of outfile to write json to
Output -
(No output but writes out files)
setDataFrame(self,scatter=None):
Generate dataframe representation of models
Input -
scatter (string) : prefix of filename to plot 3X3 regression
matrix between delay, distance and coefficiecient of causality
Output -
Dataframe with columns
['latsrc','lonsrc','lattgt', 'lontgtt','gamma','delay','distance']
- class simulateModel
Utilities for generating statistical analysis after processing models
- Attributes:
MODEL_PATH(string)- The path to the model being processed.
DATA_PATH(string)- Path to the split file.
RUNLEN(integer)- Length of the run.
READLEN(integer)- Length of split data to read from begining
CYNET_PATH - path to cynet binary.
FLEXROC_PATH - path to flexroc binary.
- Methods:
run(self, LOG_PATH=None, PARTITION=0.5, DATA_TYPE='continuous', FLEXWIDTH=1, FLEX_TAIL_LEN=100, POSITIVE_CLASS_COLUMN=5, EVENTCOL=3, tpr_thrshold=0.85, fpr_threshold=0.15): This function is intended to replace the cynrun.sh shell script. This function will use the subprocess library to call cynet on a model to process it and then run flexroc on it to obtain statistics: auc, tpr, fuc. Inputs: LOG_PATH(string)- Logfile from cynet run PARTITION(string)- Partition to use on split data FLEXWIDTH(int)- Parameter to specify flex in flwxroc FLEX_TAIL_LEN(int)- tail length of input file to consider [0: all] POSITIVE_CLASS_COLUMN(int)- positive class column EVENTCOL(int)- event column tpr_thershold(float)- tpr threshold fpr_threshold(float)- fpr threshold Returns: auc, tpr, and fpr statistics from flexroc.
- Utility functions for simulateModel:
def parallel_process(arguments): This function takes a model and produces statistics on them. The output is saved to a result file with the suffix defined by RESUFFIX. We note that arguments needs to be a list of various arguments (detailed below) due to the nature of joblib. We expect this function to be called by a parallel processing library such as joblib. Inputs: arguments(list) - a list of arguments necessary for the function: arguments[0]-FILE(str): path to the model being processed. arguments[1]-model_nums(int): Number of models to use in prediction arguments[2]-Horizon(int): prediction horizon. arguments[3]-DATA_PATH: path to split file. Ex: './split/1995-01-01_1999-12-31' arguments[4]-RUNLEN(int): the runlength arguments[5]-VARNAME(list)-Variable names to be considering. arguments[6]-RESSUFIX- suffix to add to the end of results. arguments[7]-CYNET_PATH- path to cynet binary. arguments[8]-FLEXROC_PATH- path to flexroc binary. def run_pipeline(glob_path,model_nums,horizon, DATA_PATH, RUNLEN, VARNAME, RES_PATH, RESSUFIX = '.res', cores = 4): This function is intended to take the output models from midway, process them, and produce graphs. This will call the parallel_process function in parallel using joblib. Eventually stores the result as 'res_all.csv'. Cynet and flexroc are binaries written in C++. Inputs: Glob_path(str)-The glob string to be used to find all models. EX: 'models/*model.json' model_nums(list of ints)- The model numbers to use. Ex; [10,15,20,25] Horizon(int)- prediction horizons to test in unit of temporal quantization (using cynet binary) DATA_PATH(str)-Path to the split files. Ex: './split/1995-01-01_1999-12-31' RUNLEN(int)-Length of run. Ex: 2291. VARNAME(list of str)- List of variables to consider. RES_PATH(str)- glob string for glob to locate all result files. Ex:'./models/*model*res' RESUFFIX(str)- suffix to add to the end of results.Ex:'.res' cores(int)-cores to use for parrallel processing. Outputs: Produces graphs of statistics. def get_var(res_csv, coords,varname='auc',VARNAMES=None): This function outputs graphs of the results produced by run_pipeline. The graphs concern auc, fpr, and tpr statistics. Inputs: res_csv(str)- path to 'res_all.csv' file produced by run_pipeline. coords(list of str)- the coords to consider. Ex:['lattgt1','lattgt2','lontgt1','lontgt2'] varname(str)-the variable name to consider. Ex: 'auc'. VARNAMES(str)- List of the variable name from the dataset to consider. Ex: VARNAMES=['Personnel','Infrastructure','Casualties']
- viscynet library classes:
visualization library for Network Models produced by uNetworkModels based on matplotlib
- Functions:
draw_screen_poly(lats, lons, m, ax, val, cmap, ALPHA=0.6) utility function to draw polygons on basemap Inputs - lats (list of floats): mpl_toolkits.basemap lat parameters lons (list of floats): mpl_toolkits.basemap lon parameters m (mpl.mpl_toolkits.Basemap): mpl instance for plotting ax (axis parent handle) cax (colorbar parent handle) val (Matplotlib color) cmap (string): colormap cmap parameter ALPHA (float): alpha value to use for plot Outputs - (No outputs - modifies objects in place) getalpha(arr, index, F=0.9) ction to normalize transparency of quiver Inputs - arr (iterable): list of input values index (int): index position from which alpha value should be taken from F (float): multiplier M (float): minimum alpha value Outputs - v (float): alpha value showGlobalPlot(coords, ts=None, fsize=[14, 14], cmap='jet', m=None, figname='fig', F=2): plot global distribution of events within time period specified Inputs - coords (string): filename with coord list as lat1.lat2.lon1.lon2 ts (string): time series filename with data in rows, space separated fsize (list): cmap (string): m (mpl.mpl_toolkits.Basemap): mpl instance for plotting figname (string): Name of the Plot F (int) Output - num (np.array): data values fig (mpl.figure): heatmap of events from fitted data ax (axis handler): output axis handler cax (colorbar axis handler): output colorbar axis handler viz(unet,jsonfile=False,colormap='autumn',res='c', drawpoly=False,figname='fig',BGIMAGE=None,BGIMGNAME='BM', IMGRES='high',WIDTH=0.007): Utility function to visualize spatio temporal interaction networks Inputs - unet (string): json filename unet (python dict): jsonfile (bool): True if unet is string specifying json filename colormap (string): colormap res (string): 'c' or 'f' drawpoly (bool): if True draws transparent patch showing srcs figname (string): prefix of pdf image file Outputs - m (Basemap handle) fig (figure handle) ax (axis handle) cax (colorbar handle) _scaleforsize(a) normalize array for plotting Inputs - a (ndarray): input array Output - a (ndarray): output array
- bokeh_pipe library:
visualization library for Network Models produced by uNetworkModels based on bokeh
- Process overview:
This code starts from the point when the json data files have been obtained.
- To get the neighborhood plot:
run json_to_csv on the batch of json files to get the batch of csv files.
run combine_merc to combine the batch of csv files into one csv file in mercator coordinates.
run neighbor_plot on the combined csv file to get the neighbor hood plot.
- To get the streamline plot:
same as step 1 of neighborhood plot (can be skipped if already done)
run streamheat_combine to combine the batch of csv files into one csv file. THIS IS IN A FORMAT DIFFERENT FROM THAT OF THE NEIGHBORHOOD PLOT.
run crime_stream.py on the combined file.
- To get the heatplot:
same as streamline plot.
same as streamline plot.
run heat_map on the combined file.
We have provided two sample datasets for use. ‘crime_filtered_data.csv’ can be considered the combined file for the neighborhood plot. ‘contourmerc.csv’ can be considered the combined file for the streamline plot and the heatplot.
- Functions:
json_to_csv(FILEPATH, DEST): This function takes a group of json data files and transforms them into csv files for use. Edit the selection variables as you see fit. It is very important that you initialize DEST to a folder, as it generates many csv files. WARNING: Run this function in python2. The rest of the code should use python3. THIS TAKES QUITE A BIT OF TIME. Inputs - FILEPATH (string): the filepath to the json files. Example: 'jsons/' DEST (string): the place for the csv files to be stored. Example: 'csvs/' combine_merc(DIR, filename, N = 20): This function combines the csv's into a single file. At the same time, this function will convert the format of the coordinates from longitude and latitude which is necessary to make our neighborhood plot. Our tileset accepts mercator coordinates. This generates one combined csv in the current directory. USE PYTHON 3. Inputs: DIR (string): The location(filepath) of the csvs to be combined. Example 'csvs/' filename (string): the desired name for the combined csv file. Example: 'combined.csv' N (int): the max number of sources selected for in json_to_csv: M.select(var='delay',high=20,reverse=False,inplace=True). high argument is N. neighbor_plot(filepath= 'crime_filtered_data.csv'): This is the first implementation of our Bokeh plot. The function takes the filepath of the data and opens the bokeh plot in a browser. Chrome seems to be the best browser for bokeh plots. The datafile must be a csv file in the correct format. See the file 'crime_filtered_data.csv' for an example. Each row represents a point, all the lines(sources) connected to it and the gammas and delays associated with the lines. The current implementation results in the bokeh plot, and a linked table of the data. IMPORTANT: Points are in MERCATOR Coordinates. This is because the current tileset for the map is in mercator coordinates. Example file is 'crime_filtered_data.csv' Inputs - filepath (string): input data file streamheat_combine(DIR, filename): We need to once again combine the csvs, into a format appropriate for the streamplots. This file will do that. This function will produce two files. File 1 will be in longitude and latitude. File 2 will be in mercator coordinates. We will be primiarily working with file 2 Inputs - DIR (string): The filepath to the csvs. Ex: 'csvs/' filename (string): The filename for the combined csv file. Ex: 'contourmerc.csv' crime_stream(datafile='contourmerc.csv',density=4, npoints=10, output_name='streamplot.html', method = 'cubic'): This function takes a csv datafile of crime vectors, reads it into a pandas dataframe and plots the streamplot using Delanuay interpolation. Function will open the plot in a new browser. Use chrome. Inputs: datafile: name of the csv file. Example file is 'contourmerc.csv' density: desired line density of the plot. Ex: 4. npoints: The dimensions used for the streamplot. The grid will have npoints**2 number of grids. It is not advised to have npoints > 200. Reccommended: npoints =10. ouput_name: name to save plot to. method: method for interpolation. 'cubic','linear', or 'nearest' heat_map(datafile='contourmerc.csv', npoints=300, output_name='heatmap.html', method = 'linear'): Makes a heatmap from the same datafile that cimre_stream uses. datafile: name of the datafile. Example file is 'contourmerc.csv'. npoints: dimension for plot. number of squares = npoints**2. Recommended: 100-300 Inputs - output_name (string): output file name for the plot. method (string): method for interpolation. 'cubic','linear', or 'nearest'
VERSION 1.1.03
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.