fsspec backend for the DNAnexus platform
Project description
fsspec_dnanexus
fsspec backend for DNAnexus
https://filesystem-spec.readthedocs.io/en/latest/
Installation
pip install fsspec-dnanexus
Usage
A URL for fsspec_dnanexus is constructed as follows. Paths have to be absolute paths with the leading forward slash
f"dnanexus://{PROJECT_NAME_OR_ID}:/{PATH_TO_FILE}"
Supported fsspec commands are listed below:
import fsspec
dxfs = fsspec.filesystem("dnanexus")
# Creating a directory
dxfs.mkdir("dnanexus://my-dx-project:/my/new/folder", create_parents=True)
# create_parents is True by default, but you can override it. In this case if the immedate parent does not exist
# an exception will be thrown
dxfs.mkdir("dnanexus://my-dx-project:/my/other/new/folder", create_parents=False)
# Directory listing
dxfs.ls("dnanexus://my-dx-project:/my/folder")
# Directory listing with entity type, size, etc
dxfs.ls("dnanexus://my-dx-project:/my/folder", detail=True)
DNAnexus entities are represented when listing a directory, but currently manipulation can only be done on files and folders
Libraries such as pandas and modin have fsspec support built-in, and you can use the dnanexus URL once fsspec_dnanexus is installed
Examples
Reading a file in pandas using project name
df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
Writing a pandas dataframe using project ID
df.to_csv("dnanexus://project-XXXX:/folder/filename2.csv")
Reading using fsspec.open
with fsspec.open("dnanexus://project-XXXX:/folder/filename2.csv", 'r') as fp:
fp.read()
Reading the first 10 rows all CSV files in a folder
results = dxfs.ls("dnanexus://my-dx-project:/my/folder", detail=True)
# N.B. the 'type' attribute can be any DNAnexus entity type, such as 'file', 'directory', 'applet', 'record'
files = [x for x in results if x['type'] == 'file' and x['name'].endswith('csv')]
for f in files:
url = f"dnanexus://{f['project']}:{f['name']}"
df = pd.read_csv(url)
print(df.iloc[:10])
When writing files by default it will create any intermediate directories if they do not exist.
Handling of duplicate file paths
DNAnexus platform allows for duplicate filenames (but not folders)
When reading files, a file URL resolves only to the file with the latest creation date and ignore the rest.
When writing files, the default behaviour is to mimic POSIX file system such that there are no surprises from users coming from other fsspec backends, but this can easily be overrided in storage_options:
- allow_duplicate_filenames: False - [default] removes existing file(s) with the same path and writes the new file (just like file:// s3:// and other backends would)
- allow_duplicate_filenames: True - writes the file at specified path disregarding existing files
If the user's token allows for file writing but does not allow for the removal of the file (e.g. protected projects), the behaviour falls back to allow_duplicate_filenames: True to ensure no data loss.
Credentials
The credentials used by fsspec_dnanexus to access DNAnexus is resolved in the following order:
-
The token parameter passed into DXFileSystem's constructor or when making pandas calls
dxfs = DXFileSystem(storge_options = {"token": "YOUR_DNANEXUS_TOKEN"}) df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv", storage_options={ "token": "YOUR_DNANEXUS_TOKEN", })
-
The FSSPEC_DNANEXUS_TOKEN environment variable
os.environ['FSSPEC_DNANEXUS_TOKEN'] = "YOUR_DNANEXUS_TOKEN" df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
-
Inherits the credentials currently used by dxpy. If you're using a DNAnexus workstation, this is a good place to start.
df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
Limitations
-
The following commands are currently unsupported:
- dxfs.touch
- dxfs.cat
- dxfs.copy
- dxfs.rm
-
Parquet support has not beem tested
-
No caching, which means repeated reads of a file will be inefficient
-
fsspec transactions (e.g.
with fs.transaction:
) are not supported. -
Files not in 'closed' state on DNAnexus are not listed by ls() currently.
Logging
You can override the logging level by setting the following environment variable
os.environ["FSSPEC_DNANEXUS_LOGGING_LEVEL"] = "DEBUG"
Valid logging levels are listed here: https://docs.python.org/3/library/logging.html#levels
Changelog
[0.0.2] - 2023-05-29
Fixed
- Project description in PyPi, corrected import statement in README
[0.0.1] - 2023-05-29
Added
- Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for fsspec_dnanexus-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cd068d3903e8900e3da1555debb521b04ef48732251bd02c9d9675a45e52e09 |
|
MD5 | 1757137523643ff9b4c10deaa55f0356 |
|
BLAKE2b-256 | ca3ae6fc9b6454a8c2e0265a64116c74ca89f79770549f7724b94ddfe9f5636e |