A utility for generating hybrid ephys data
Project description
Hybrid ground-truth generation for spike-sorting
Installation
The best way to get started is to install Anaconda or Miniconda.
Once you've done that, fire up your favorite terminal emulator (PowerShell or CMD on Windows, but we recommend CMD; iTerm2 or Terminal on Mac;
lots of choices if you're on Linux, but you knew that) and navigate to the directory containing this README file (it
also contains environment.yml
).
On UNIX variants, type:
$ conda env create -n hybridfactory
Solving environment: done
Downloading and Extracting Packages
...
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > source activate hybridfactory
#
# To deactivate an active environment, use:
# > source deactivate
#
$ source activate hybridfactory
On Windows:
$ conda env create -n hybridfactory
Solving environment: done
Downloading and Extracting Packages
...
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > activate hybridfactory
#
# To deactivate an active environment, use:
# > deactivate
#
# * for power-users using bash, you must source
#
$ activate hybridfactory
and you should be good to go. Remember that [source] activate hybridfactory
every time you open up a new shell.
Usage
This tool is primarily a command-line utility. Provided you have a parameter file, you can invoke it like so:
(hybridfactory) $ hybridfactory generate /path/to/params.py
Right now, generate
is the only command available, allowing you to generate hybrid data from a pre-existing raw
data set and output from a spike-sorting tool, e.g., KiloSort or
JRCLUST.
This is probably what you want to do.
After your hybrid data has been generated, we have some validation tools you can use to look at your hybrid output, but this is not as convenient as a command-line tool (yet).
A word about bugs
This software is under active development. Although we strive for accuracy and consistency, there's a good chance you'll run into some bugs. If you run into an exception which crashes the program, you should see a helpful message with my email address and a traceback. If you find something a little more subtle, please post an issue on the issue page.
Parameter file
Rather than pass a bunch of flags and arguments to hybridfactory
, we have collected all the parameters in a
parameter file, params.py
.
We briefly explain each option below.
See params_example.py for an example.
Required parameters
data_directory
: Directory containing output from your spike sorter, e.g.,rez.mat
or*.npy
for KiloSort; or*_jrc.mat
and*_spk(raw|wav|fet).jrc
for JRCLUST.raw_source_file
: Path to file containing raw source data (currently only SpikeGL[X]-formatted data is supported). This can also be a glob if you have multiple data files.data_type
: Type of raw data, as a NumPy data type. (I have only seenint16
.)sample_rate
: Sample rate of the source data, in Hz.ground_truth_units
: Cluster labels (1-based indexing) of ground-truth units from your spike sorter's output.start_time
: Start time (0-based) of recording in data file (in sample units). Nonnegative integer ifraw_source_file
is a single file, iterable of nonnegative integers if you have a globbedraw_source_file
. If you have SpikeGL meta files, you can usehybridfactory.io.spikegl.get_start_times
to get these automagically.
Probe configuration
-
probe_type
: Probe layout. This is pretty open-ended so it is up to you to construct. If you have a Neuropixels Phase 3A probe with the standard reference channels, you have it easy. Just putneuropixels3a()
for this value. Otherwise, you'll need to construct the following NumPy arrays to describe your probe:channel_map
: a 1-d array ofn
ints describing which row in the data to look for which channel (0-based).connected
: a 1-d array ofn
bools, with entryk
beingTrue
if and only if channelk
was used in the sorting.channel_positions
: an $n \times 2
$ array of floats, with rowk
holding the x and y coordinates of channelk
.name
(optional): a string giving the model name of your probe. This is just decorative for now.
With these parameters, you can pass them to
hybridfactory.probes.custom_probe
like so:
# if your probe has a name
probe = hybridfactory.probes.custom_probe(channel_map, connected, channel_positions, name)
# alternatively, if you don't want to specify a name
probe = hybridfactory.probes.custom_probe(channel_map, connected, channel_positions)
Be sure to import hybridfactory.probes
in your params.py
(see the example params.py
to get a feel for this).
Optional parameters
session_name
: String giving an identifying name to your hybrid run. Default is an MD5 hash computed from the current timestamp.random_seed
: Nonnegative integer in the range $[0, 2^{31})
$. Because this algorithm is randomized, setting a random seed allows for reproducible output. The default is itself randomly generated, but will be output in ahfparams_[session_name].py
on successful completion.output_directory
: Path to directory where you want to output the hybrid data. (This includes raw data files and annotations.) Defaults to "data_directory
/hybrid_output".output_type
: Type of output from your spike sorter. One of "phy" (for*.npy
), "kilosort" (forrez.mat
), or "jrc" (for*_jrc.mat
and*_spk(raw|wav|fet).jrc
).hybridfactory
will try to infer it from files indata_directory
if not specified.num_singular_values
: Number of singular values to use in the construction of artificial events. Default is 6.channel_shift
: Number of channels to shift artificial events up or down from their source. Default depends on the probe used.synthetic_rate
: Firing rate, in Hz, for hybrid units. This should be either an empty list (if you want to use the implicit firing rate of your ground-truth units) or an iterable of artificial rates. In the latter case, you must specify a firing rate for each ground-truth unit. Default is the implicit firing rate of each unit.time_jitter
: Scale factor for (normally-distributed) random time shift, in sample units. Default is 100.amplitude_scale_min
: Minimum factor for (uniformly-distributed) random amplitude scaling, in percentage units. Default is 1.amplitude_scale_max
: Maximum factor for (uniformly-distributed) random amplitude scaling, in percentage units. Default is 1.samples_before
: Number of samples to take before an event timestep for artificial event construction. Default is 40.samples_after
: Number of samples to take after an event timestep for artificial event construction. Default is 40.copy
: Whether or not to copy the source file to the target. You usually want to do this, but if the file is large and you know where your data has been perturbed, you could useHybridDataSet.reset
instead. Default is False.
Validation tools
For KiloSort output, we compare (shifted) templates associated with the artificial events to templates from the sorting of the hybrid data. This will probably be meaningless unless you use the same master file to sort the hybrid data that you used to sort the data from which we derived our artificial events. We compare in one of two ways: by computing Pearson correlation coefficients of the flattened templates (in which case, higher is better), or by computing the Frobenius norm of the difference of the two templates (lower is better here). When we find the best matches in a 2 ms interval around each true firing, we can generate a confusion matrix to see how we did.
This functionality is not in generate.py
, but should be used in a Jupyter notebook (for now).
Adding a demo notebook is a TODO.
Adding more validation tools is another TODO. Suggestions for tools you'd want to see are always welcome.
Output
If successful, generate.py
will output several files in output_directory
:
- Raw data files.
The filenames of your source data file will be reused, prepending
.GT
before the file extension. For example, if your source file is calleddata.bin
, the target file will be nameddata.GT.bin
and will live inoutput_directory
. - Dataset save files.
These include:
metadata-[session_name].csv
: a table of filenames, start times, and sample rates of the files in your hybrid dataset (start times and sample rates should match those of your source files).annotations-[session_name].csv
: a table of (real and synthetic) cluster IDs, timesteps, and templates (Kilosort only) or assigned channels (JRCLUST only).artificial_units-[session_name].csv
: a table of new cluster IDs, true units, timesteps, and templates (Kilosort only) or assigned channels (JRCLUST only) for your artificial units.probe-[session_name].npz
: a NumPy-formatted archive of data describing your probe. (See Probe configuration for a description of these data.)dtype-[session_name].npy
: a NumPy-formatted archive containing the sample rate of your dataset in the same format as your raw dataset.
firings_true.npy
. This is a $3 \times K
$ array ofuint64
, where $K
$ is the number of events generated.- Row 0 is the channel on which the event is centered, zero-based.
- Row 1 is the timestamp of the event in sample units, zero-based.
- Row 2 is the unit/cluster ID from the original data set for the event.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hybridfactory-0.1.0b1.tar.gz
.
File metadata
- Download URL: hybridfactory-0.1.0b1.tar.gz
- Upload date:
- Size: 43.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0973df03851720e82270ae417933a224ff3333a6cb138b4238f2a57b371dd96d |
|
MD5 | 9ff2c8877e870e24772ac91144812c6f |
|
BLAKE2b-256 | 22b6ddaeef45f680793ce2f7133c112f5cd1e800db90c5eb81a1fd1308167f99 |
File details
Details for the file hybridfactory-0.1.0b1-py3-none-any.whl
.
File metadata
- Download URL: hybridfactory-0.1.0b1-py3-none-any.whl
- Upload date:
- Size: 36.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c43bb4e2caa28286f58985fb15319e5010fd183ff259e76fc662d9846167560d |
|
MD5 | db39bc4915f912da4e85220c08c8fbe4 |
|
BLAKE2b-256 | 9eb6555b1da47764f070bc5086eff52b990624428c17d52fd3af295b45b260e3 |