fsspec backend for the DNAnexus platform
Project description
fsspec_dnanexus
Installation
pip install fsspec-dnanexus
Usage
A URL for fsspec_dnanexus is constructed as follows. Paths have to be absolute paths with the leading forward slash
f"dnanexus://{PROJECT_NAME_OR_ID}:/{PATH_TO_FILE}"
Supported fsspec commands are listed below:
import fsspec
dxfs = fsspec.filesystem("dnanexus")
# Creating a directory
dxfs.mkdir("dnanexus://my-dx-project:/my/new/folder", create_parents=True)
# create_parents is True by default, but you can override it. In this case if the immedate parent does not exist
# an exception will be thrown
dxfs.mkdir("dnanexus://my-dx-project:/my/other/new/folder", create_parents=False)
# Directory listing
dxfs.ls("dnanexus://my-dx-project:/my/folder")
# Directory listing with entity type, size, etc
dxfs.ls("dnanexus://my-dx-project:/my/folder", detail=True)
DNAnexus entities are represented when listing a directory, but currently manipulation can only be done on files and folders
Libraries such as pandas and modin have fsspec support built-in, and you can use the dnanexus URL once fsspec_dnanexus is installed
Examples
Reading a CSV or parquet in pandas using project name
df = pd.read_csv("dnanexus://my-dx-project:/folder/data.csv")
df = pd.read_parquet("dnanexus://my-dx-project:/folder/data.parquet")
Writing a pandas dataframe using project ID
df.to_csv("dnanexus://project-XXXX:/folder/filename2.csv")
Reading using fsspec.open
with fsspec.open("dnanexus://project-XXXX:/folder/filename2.csv", 'r') as fp:
fp.read()
Reading the first 10 rows all CSV files in a folder
results = dxfs.ls("dnanexus://my-dx-project:/my/folder", detail=True)
# N.B. the 'type' attribute can be any DNAnexus entity type, such as 'file', 'directory', 'applet', 'record'
files = [x for x in results if x['type'] == 'file' and x['name'].endswith('csv')]
for f in files:
url = f"dnanexus://{f['project']}:{f['name']}"
df = pd.read_csv(url)
print(df.iloc[:10])
When writing files by default it will create any intermediate directories if they do not exist.
Handling of duplicate file paths
DNAnexus platform allows for duplicate filenames (but not folders)
When reading files, a file URL resolves only to the file with the latest creation date and ignore the rest.
When writing files, the default behaviour is to mimic POSIX file system such that there are no surprises from users coming from other fsspec backends, but this can easily be overrided in storage_options:
- allow_duplicate_filenames: False - [default] removes existing file(s) with the same path and writes the new file (just like file:// s3:// and other backends would)
- allow_duplicate_filenames: True - writes the file at specified path disregarding existing files
If the user's token allows for file writing but does not allow for the removal of the file (e.g. protected projects), the behaviour falls back to allow_duplicate_filenames: True to ensure no data loss.
Modin and ray
Required Initialization
When using fsspec_dnanexus under modin and ray, you'll need to invoke an initialization function.
For most users, the following function would perform the necessary initialization for fsspec_dnanexus to work under modin and ray
import fsspec_dnanexus
fsspec_dnanexus.init_for_modin_on_ray()
For advanced users, manually set the number of partitions. This impacts the number of chunks modin uses to process the dataframe in parallel and affects the number of files
import fsspec_dnanexus
fsspec_dnanexus.init_for_modin_on_ray(n_partitions=20)
Parquet folder
Note that when using df.to_parquet, each partition will simultaneously write a part of the dataframe and the resultant output is a folder containing many parts rather than a single file.
Other modin engines
Other modin engines such as dask and unidist are currently not supported.
Credentials
The credentials used by fsspec_dnanexus to access DNAnexus is determined by the following methods:
-
By default fsspec_dnanexus inherits the credentials currently used by dxpy. If you're using a DNAnexus workstation, this is a good place to start as you don't have to do anything.
df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
-
If the FSSPEC_DNANEXUS_TOKEN environment variable is set, it will be used over dx-toolkit's credentials
os.environ['FSSPEC_DNANEXUS_TOKEN'] = "YOUR_DNANEXUS_TOKEN" df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv")
-
If a DNAnexus token parameter passed in using storage_options, this takes priority over the two methods above
# Option 1a dxfs = fsspec.filesystem("dnanexus", storage_options = {"token": "YOUR_DNANEXUS_TOKEN"}) # Option 1b df = pd.read_csv("dnanexus://my-dx-project:/folder/filename1.csv", storage_options={ "token": "YOUR_DNANEXUS_TOKEN", })
Limitations
-
The following commands are currently unsupported:
- dxfs.touch
- dxfs.cat
- dxfs.copy
- dxfs.rm
-
No local caching, which means repeated reads of a file will incur repeated downloads
-
fsspec transactions (e.g.
with fs.transaction:
) are not supported. -
Files not in 'closed' state on DNAnexus are not listed by ls() currently.
Logging
You can override the logging level by setting the following environment variable os.environ["FSSPEC_DNANEXUS_LOGGING_LEVEL"] = "DEBUG"
Valid logging levels are listed here: https://docs.python.org/3/library/logging.html#levels
Changelog
0.2.5 (2024-09-24)
- Fixed: Error when using dxfs.upload()
0.2.4 (2024-05-10)
- Fixed: Dependencies conflicted with s3fs
0.2.3 (2023-12-07)
- Fixed: reading and writing hidden files
0.2.2 (2023-11-15)
- Fixed: using modin 0.24 no longer resulted in single node reads
- Fixed: resolved package conflicts
0.2.1 (2023-10-27)
- Feature: Compatiblility with PyArrow FileSystem wrapper.
- Added "size", "project", "full_url" properties to results returned by
dxfs.ls
anddxfs.find
when the object is a directory
- Added "size", "project", "full_url" properties to results returned by
- Bumped dependencies versions:
- dxpy==0.358.0, fsspec==2023.9.2, botocore
0.2.0 (2023-09-13)
- modin + ray improvements
- Use
fsspec_dnanexus.init_for_modin_on_ray()
to initialize the library for modin on ray - Significant increase in read performance + other optimizations
- Fixed writing a small csv file using modin and ray
- Use
0.1.0 (2023-08-04)
- Feature: Support for modin + ray
- Known issues:
- Under modin and ray, when writing a small csv file would result in an exception. The workaround is to set n_partition mentioned above to a low number
0.0.3 (2023-06-05)
- Fixed: When using PyArrow to read parquet,
pd.read_parquet
can be invoked directly, no longer requiring passing in file handler fromfsspec.open()
.
0.0.2 (2023-05-29)
- Fixed: Project description in PyPi, corrected import statement in README
0.0.1 (2023-05-29)
- Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fsspec_dnanexus-0.2.5.tar.gz
.
File metadata
- Download URL: fsspec_dnanexus-0.2.5.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1916926ad2698f59d29bc87d92529bda17c170afb7d83c354df32b89702aa910 |
|
MD5 | b585542e2a78ff50b09eb5e4354124ef |
|
BLAKE2b-256 | 3ea1591b3409d1e01d9314ebfbae0a74ea02477100d0adf3255180df9f711eac |
File details
Details for the file fsspec_dnanexus-0.2.5-py3-none-any.whl
.
File metadata
- Download URL: fsspec_dnanexus-0.2.5-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a1830a9de67a5710898714d97c4428670c288263d06d54ea5eb671b57468088 |
|
MD5 | fafcb780c6f3f0850c0c5834557b70ec |
|
BLAKE2b-256 | b30965c8c76a9b2a8bc739e4ce47e7cc890a4d0c3cac1984080b7dd733b1e796 |