A django library for running Nextflow pipelines and storing their result.
Project description
django-nextflow
django-nextflow is Django app for running Nextflow pipelines and storing their results in a database within a Django web app.
Installation
nextflow.py is available through PyPI:
pip install django-nextflow
You must install the Nextflow executable itself separately: see the Nextflow Documentation for help with this.
Setup
To use the app within Django, add django-nextflow
to your list of
INSTALLED_APPS
.
You must define four values in your settings.py
:
-
NEXTFLOW_PIPELINE_ROOT
- the location on disk where the Nextflow pipelines are stored. All references to pipeline files will use this as the root. -
NEXTFLOW_DATA_ROOT
- the location on disk to store execution records. -
NEXTFLOW_UPLOADS_ROOT
- the location on disk to store uploaded data. -
NEXTFLOW_PUBLISH_DIR
- the name of the folder published files will be saved to. Within an execution directory, django-nextflow will look in NEXTFLOW_PUBLISH_DIR/process_name for output files for that process. These files must be published as symlinks, not copies, otherwise django-nextflow will not recognise them.
Usage
Begin by defining one or more Pipelines. These are .nf files somewhere within
the NEXTFLOW_PIPELINE_ROOT
you defined:
from django_nextflow.models import Pipeline
pipeline = Pipeline.objects.create(path="workflows/main.nf")
You can also provide paths to a JSON input schema file (structured using the nf-core style) and a config file to use when running it:
pipeline = Pipeline.objects.create(
path="workflows/main.nf",
description="Some useful pipeline.",
schema_path="main.json",
config_path="nextflow.config"
)
print(pipeline.input_schema) # Returns inputs as dict
To run the pipeline:
execution = pipeline.run(params={"param1": "xxx"})
This will run the pipeline using Nextflow, and save database entries for three different models:
-
The
Execution
that is returned represents the running of this pipeline on this occasion. It stores the stdout and stderr of the command, and has aget_log_text()
method for reading the full log file from disk. A directory will be created inNEXTFLOW_DATA_ROOT
for the execution to take place in. -
ProcessExecution
records for each process that execution within the running of the pipeline. These also have their own stdout and stderr, as well as status information etc. -
Data
records for each file published by the processes in the pipeline. Note that this is not every file produced - but specifically those output by the process via its output channel. For this to work the processes must be configured to publish these files to a particular directory name (the one thatNEXTFLOW_PUBLISH_DIR
is set to), and to a subdirectory within that directory with the process's name.
If you want to supply a file for which there is a Data
object as the input to
a pipeline, you can do so as follows:
execution = pipeline.run(
params={"param1": "xxx"},
data_params={"param2": 23, "param3": [24, 25]}
)
...where 23, 24 and 25 are the IDs of Data
objects.
You can also supply entire executions as inputs, in which case they will be provided to the pipeline as a directory of symlinked files:
execution = pipeline.run(
params={"param1": "xxx"},
execution_params={"genome1": 23, "genome2": 24}
)
The Data
objects above were created by running some pipeline, but you might
want to create one from scratch without running a pipeline. You can do so either
from a path string, or from a Django UploadedFile
object:
data1 = Data.create_from_path("/path/to/file.txt")
data2 = Data.create_from_upload(django_upload_object)
The file will be copied to NEXTFLOW_UPLOADS_ROOT
in this case.
Changelog
0.5
3rd February, 2022
- Pipelines can now take execution inputs.
- Fixed method for detecting downstream data products.
0.4
12th January, 2022
- Better support for multiple data objects.
- Data objects can now be directories, which will be automatically zipped.
- When creating upstream data connections, data objects will be created if needed.
0.3.2
26th December, 2021
- Allow IDs to be big ints.
0.3.1
24th December, 2021
- Data file sizes can now be more than 232.
- Data file names can now be 1000 characters long.
0.3
21st December, 2021
- Pipelines can now take multiple data inputs per param.
- Profiles can now be specified when running a pipeline.
- Compression extension .gz now ignored when detecting filetype.
- Process executions start and end times are now recorded.
- Improved system for identifying upstream data inputs.
- Improved publish_dir identification.
- Improved log file reading.
0.2
14th November, 2021
- Pipelines now have description fields.
- Data objects now have creation time fields.
- Added upstream data objects as well as downstream to process executions.
0.1.1
3rd November, 2021
- Fixed duration string parsing.
0.1
29th October, 2021
- Initial models for pipelines, execution, process executions and data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for django_nextflow-0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e599a2f85e4cb02205aced0c0f3aa9c44d1175631f2b96e74223270f5580272 |
|
MD5 | 2965eacf10b24685ce990b53fdb85ebe |
|
BLAKE2b-256 | b273155ddeac05877385fa4a76b31393d14bf4511908be0978c9c9d09cd818c9 |