A tool for translating between scientific data structures
Project description
dastr
Motivation
Open-source datasets often aren't organized according to a standard directory structure. This means that, if your analysis code expects a particular format, either you have to create a new version of it that loops through files differently or you have to manually move and rename the files to match your code's expected format. dastr
automates the latter approach.
How to use
dastr
can be installed through pip (pip install dastr
).
There are 3 steps to convert between directory structures: (1) reading, (2) translating, and (3) writing.
1. Reading
There will generally be metadata encoded in a dataset's folder and file names. For example, maybe each participant's data are stored in a separate folder and these are called "p01", "p02", etc. Maybe it's a longitudinal EEG study and within each participant-specific folder are files called "pre.edf" and "post.edf".
dastr
uses regular expression captures (with Python's re
library) to read this data into a table (a list
of dict
s) where each row (element) is a file and each column (key) is an attribute of that file (for example, participant ID and session).
To read in the data from the above example, you would do the following:
import dastr
data_path = '/path/to/data/'
files = dastr.read(
path=data_path,
params=[
("p(.+)", "participant ID"),
("(.+)\.edf", "session")
],
disp=True)
The function will start from the folder specified by the path
argument and go down from there. The action to take at each level of the directory structure is specified by each elements of the params
argument (a list of tuples, the elements of which will be strings). The first element of each tuple will be a regular expression. If there are any captures in the regular expression, they will be recorded as the attributes specified by the remaining elements of the tuple. For example: ("(.+)\.edf", "session")
in the above means that, once the program gets to the file /path/to/data/p01/pre.edf
it should run something equivalent in this case to:
curr_file["session"] = re.findall("(.+)\.edf", "pre.edf")[0]
If the current file/folder doesn't match the regular expression, this file is skipped. If the tuple is empty, the regular expression defaults to the wildcard character (.
).
dastr.read()
returns a list
of dicts
s with keys attrs
and path
. path
specifies the path to the actual file while attrs
contains another dictionary that stores the attributes. I.e., in the above example, files[0]["path"]
would be "/path/to/data/p01/pre.edf"
while files[0]["attrs"]
would be {"participant ID": "01", "session": "pre"}
. To print this information as it is being read, set the optional argument disp=True
as in the above (or omit this argument to print nothing).
In some cases, this might be all you need. You could use dastr.flatten()
to get a new list
of dict
s with key-value pairs copied from files[:]["attrs"]
, plus an additional key-value pair specifying the path of the file. This makes perfect input for DictWriter
from Python's csv
library. The resulting csv table could be read in by your analysis code, point it to each data file, and take its outputs as new columns. Or maybe you don't want to change your analysis code.
2. Translating
Suppose instead of "pre" and "post", you want to call the sessions "01" and "02". The function to use then is dastr.translate()
:
translation = {
"session": {
"pre": "01",
"post": "02"
}
}
translated = dastr.translate(
files=files,
translation=translation)
original = dastr.translate(
files=translated,
translation=translation,
direction="reverse") # Or equivalently "backward", or actually anything other than "forward"
3. Writing
Actually moving the files is very similar to dastr.read()
ing them. The difference is that, instead of reading attributes in using re
captures, you're writing them out using string formatting. For example
new_path = "/new/data/path/"
destinations = dastr.write(
files=files,
path=new_path
params=[
"alldata",
("sub-%s", "participant ID"),
("sub-%s_ses-%s.edf", "participant ID", "session")])
creates the variable destinations
, a list
of str
ings specifying the new locations of the files (though they haven't been deleted from their old locations):
/new/data/path/alldata/sub-01/sub-01_ses-01.edf
/new/data/path/alldata/sub-01/sub-01_ses-02.edf
/new/data/path/alldata/sub-02/sub-02_ses-01.edf
/new/data/path/alldata/sub-02/sub-02_ses-02.edf
If you did want to delete the old files, you could add the optional argument key="x"
to use Python's shutil.move()
(by default, key
is "c"
which uses Python's shutil.copy()
). You can also set key
to any function that takes old_path, new_path
as its arguments, or you can set it to "n"
which doesn't touch the files at all (in which case you'd probably also want to set disp=True
).
How to use with JSON
To avoid hard-coding the parameters and translations, you can instead specify them in .json files (thanks to Gabi Herman for the suggestion):
Reading and writing with JSON
Instead of running
dastr.[read/write](
...,
params=params,
...)
you would run
dastr.[read/write](
...,
params=dastr.json_to_params("path/to/file.json"),
...)
Where file.json
is formatted in one of 3 ways:
- a
list
ofdict
s
[
{
"pattern": "p(.+)",
"attrs": "participant ID"
},
{
"pattern": "(.+)\.edf",
"attrs": "session"
}
]
- a
dict
oflist
s
{
"patterns": [
"p(.+)",
"(.+)\.edf"
],
"attrs": [
"participant ID",
"session"
]
}
- a
list
oflist
s andstr
ings
[
"alldata",
["sub-%s", "participant ID"],
["sub-%s_ses-%s.edf", "participant ID", "session"]
]
Translating with JSON
Here all you need to do is copy and paste your hard-coded translation
variable into a .json
file (replacing single quotes with doubles), and pass the location of this file to dastr.translate
:
translated = dastr.translate(
...,
translation="path/to/translation.json")
Running dastr.json_to_params
wouldn't work here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.