Skip to main content

fsspec backend for the DNAnexus platform

Project description

fsspec_dnanexus

fsspec backend for DNAnexus

Installation

pip install fsspec-dnanexus

Usage

A URL for fsspec_dnanexus is constructed as follows. Paths have to be absolute paths with the leading forward slash

f"dnanexus://{PROJECT_NAME_OR_ID}:/{PATH_TO_FILE}"

Supported fsspec commands are listed below:

import fsspec
dxfs = fsspec.filesystem("dnanexus")

# Creating a directory
dxfs.mkdir("dnanexus://my-dx-project:/my/new/folder", create_parents=True)

# create_parents is True by default, but you can override it. In this case if the immedate parent does not exist
# an exception will be thrown
dxfs.mkdir("dnanexus://my-dx-project:/my/other/new/folder", create_parents=False)

# Directory listing
dxfs.ls("dnanexus://my-dx-project:/my/folder")

# Directory listing with entity type, size, etc
dxfs.ls("dnanexus://my-dx-project:/my/folder", detail=True)

DNAnexus entities are represented when listing a directory, but currently manipulation can only be done on files and folders

Libraries such as pandas and modin have fsspec support built-in, and you can use the dnanexus URL once fsspec_dnanexus is installed

Examples

Reading a CSV or parquet in pandas using project name

df = pd.read_csv("dnanexus://my-dx-project:/folder/data.csv")
df = pd.read_parquet("dnanexus://my-dx-project:/folder/data.parquet")

Writing a pandas dataframe using project ID

df.to_csv("dnanexus://project-XXXX:/folder/filename2.csv")

Reading using fsspec.open

with fsspec.open("dnanexus://project-XXXX:/folder/filename2.csv", 'r') as fp:
    fp.read()

Reading the first 10 rows all CSV files in a folder

results = dxfs.ls("dnanexus://my-dx-project:/my/folder", detail=True)
# N.B. the 'type' attribute can be any DNAnexus entity type, such as 'file', 'directory', 'applet', 'record'
files = [x for x in results if x['type'] == 'file' and x['name'].endswith('csv')]
for f in files:
    url = f"dnanexus://{f['project']}:{f['name']}"
    df = pd.read_csv(url)
    print(df.iloc[:10])

When writing files by default it will create any intermediate directories if they do not exist.

Handling of duplicate file paths

DNAnexus platform allows for duplicate filenames (but not folders)

When reading files, a file URL resolves only to the file with the latest creation date and ignore the rest.

When writing files, the default behaviour is to mimic POSIX file system such that there are no surprises from users coming from other fsspec backends, but this can easily be overrided in storage_options:

  • allow_duplicate_filenames: False - [default] removes existing file(s) with the same path and writes the new file (just like file:// s3:// and other backends would)
  • allow_duplicate_filenames: True - writes the file at specified path disregarding existing files

If the user's token allows for file writing but does not allow for the removal of the file (e.g. protected projects), the behaviour falls back to allow_duplicate_filenames: True to ensure no data loss.

Credentials

The credentials used by fsspec_dnanexus to access DNAnexus is resolved in the following order:

  1. The token parameter passed in using storage_options

     # Option 1a
     dxfs = fsspec.filesystem("dnanexus", storage_options = {"token": "YOUR_DNANEXUS_TOKEN"})
    
     # Option 1b
     df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv", storage_options={
         "token": "YOUR_DNANEXUS_TOKEN",
     })
    
  2. The FSSPEC_DNANEXUS_TOKEN environment variable

     os.environ['FSSPEC_DNANEXUS_TOKEN'] = "YOUR_DNANEXUS_TOKEN"
     df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
    
  3. Inherits the credentials currently used by dxpy. If you're using a DNAnexus workstation, this is a good place to start.

     df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
    

Limitations

  1. The following commands are currently unsupported:

    • dxfs.touch
    • dxfs.cat
    • dxfs.copy
    • dxfs.rm
  2. No local caching, which means repeated reads of a file will incur repeated downloads

  3. fsspec transactions (e.g. with fs.transaction:) are not supported.

  4. Files not in 'closed' state on DNAnexus are not listed by ls() currently.

Logging

You can override the logging level by setting the following environment variable os.environ["FSSPEC_DNANEXUS_LOGGING_LEVEL"] = "DEBUG"

Valid logging levels are listed here: https://docs.python.org/3/library/logging.html#levels

Changelog

0.0.3 (2023-06-05)

  • Fixed: When using PyArrow to read parquet, pd.read_parquet can be invoked directly, no longer requiring passing in file handler from fsspec.open().

0.0.2 (2023-05-29)

  • Fixed: Project description in PyPi, corrected import statement in README

0.0.1 (2023-05-29)

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fsspec_dnanexus-0.0.3-py3-none-any.whl (14.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page