A set of command-line utilities that facilitate data processing at the Single Cell Biology Lab at the Jackson Laboratory.
Project description
scbl-utils
A set of command-line utilities that facilitates data processing in the Single Cell Biology Lab at the Jackson Laboratory.
Top-level Usage
scbl-utils [OPTIONS] COMMAND [ARGS]...
Example Usage
scbl-utils samplesheet-from-gdrive /path/to/fastq_dir /path/to/another/fastq_dir /path/to/a_dir/of/fastq_dirs/*
Top-level Options
--config-dir, -c
: Configuration directory containing files necessary for script to run. (default:/sc/service/etc/.config/scbl-utils
)
Commands
-
samplesheet-from-gdrive
Pull data from Google Drive and generate a
yml
samplesheet to be used as input for thenf-tenx
pipeline.Usage
scbl-utils samplesheet-from-gdrive [OPTIONS] FASTQDIRS
Options
--outsheet, -o
: File path to save the resulting samplesheet. [default:samplesheet.yml
]--reference-path-as-str, -s
: If possible, write thereference_path
field of the outputted samplesheet as a string rather than a list of strings. This enables compatability with the currentnf-tenx
pipeline and wll be deprecated in the future as the pipeline is updated.
Requirements
For the script to work, the
config-dir
(as defined here) must contain a directory calledgoogle-drive
. In{config-dir}/google-drive
, 3 files must exist:trackingsheet-spec.yml
: a specification that instructs the script on how to read data from the Google Sheet being used as the "sample tracking sheet". The specification must contain the following keys:-
id
: the Google Spreadsheet ID, found in https://docs.google.com/spreadsheets/d/`spreadsheet_id`/ -
sheets
: a mapping of the ID of each worksheet (docs.google.com/spreadsheets/d/spreadsheet_id
/edit#gid=worksheet_id
) within the spreadsheet to another mapping of information about that sheet. The keys of thisdict
must contain:columns
: yet another mapping, this time mapping the column names in the sheet to how they should be renamed in the script. The union of all of the values of these mappings should minimally be{10x_platform, sample_name, is_nuclei, libraries, project, species, n_cells, slide, area, tag_id}
, andlibraries
must exist in the sheets wherejoin == true
to join the sheets.header_row
: The index of the header row (0-based), which contains the column namesjoin
: abool
indicating whether to join this sheet (along the columns) to the other sheets in the spreadsheet. Useful for spreadsheets with multiple sheets, but not every sheet shares the same index, meaning they are not necessarily joinable
-
platform_to_lib_type
: another mapping, this time mapping the name of a 10X platform to the library typeExample
trackingsheet-spec.yml
:id: <spreadsheet_id> sheets: 0: columns: 10X Platform: 10x_platform Customer ID: sample_name Is Nuclei: is_nuclei Sample Name (SCBID): libraries SCBL Project: project Species: species Targeted Cell Recovery: n_cells header_row: 2 join: true 2: columns: Sample Name (SCBID): libraries Serial Number GEX Slide: slide header_row: 2 join: true 4: columns: Sample Name (SCBID): libraries Position on Slide: area header_row: 2 join: true 5: columns: Customer ID: sub_sample_name Pool ID: sample_name SCID: libraries Tag ID: tag_id Tissue/Cell Type: description header_row: 0 join: false platform_to_lib_type: 3' RNA: Gene Expression 3' RNA-HT: Gene Expression 5' RNA: Gene Expression 5' RNA-HT: Gene Expression 5' VDJ: Immune Profiling ATAC: Chromatin Accessibility ATAC v2: Chromatin Accessibility Automated RNA: Gene Expression Cell Surface: Antibody Capture CellPlex: Multiplexing Capture Flex: Gene Expression HTO: Multiplexing Capture LMO: Multiplexing Capture Multiome ATAC: Chromatin Accessibility Multiome RNA: Gene Expression RNA: Gene Expression RNA-HT: Gene Expression Visium CytAssist FFPE: CytAssist Gene Expression Visium FF: Spatial Gene Expression Visium FFPE: Spatial Gene Expression
-
metricssheet-spec.yml
: a specification that instructs the script on how to read metrics sheets from Google Drive. This is useful for automated assignment of processing tool, tool version, and reference genome, as the script looks for old metrics spreadsheets within the same "SCBL Project" to assign these values when possible. The necessary keys are similar totrackingsheet-spec.yml
:-
dir_id
: the ID of the Google Drive folder where delivered metrics are stored. Found in https://drive.google.com/drive/folders/`folderID` -
header_row
: the header row of the metrics sheets. This assumes that all sheets within all spreadsheets in the metrics delivery folder have the same header row -
columns
: just liketrackingsheet-spec.yml
, this is a mapping of the column names as they appear in the spreadsheets to how they should be named in the script. The union of these columns should minimally be{project, tool, tool_version, reference, libraries}
. Because different metrics sheets have thelibraries
column as a different name, it may be necessary to add key-value pairs to this if the script throws apandas
error along the lines ofKeyError: '{key}'
. Because the script uses this mapping to determine what columns in the spreadsheet should go into apandas.DataFrame
, it will throw an error if it hasn't been informed that a certain column in the spreadsheet is really theproject
column, for example.Example
metricssheet-spec.yml
dir_id: <dir_id> header_row: 0 columns: SCBL Project: project Processing Tool: tool Processing Tool Version: tool_version Processing Reference: reference Library ID(s): libraries Sample ID: libraries
-
service-account.json
: Ajson
file that stores credentials for a service account associated with a Google Cloud Project. This shouldn't be an issue, but if the script is throwing Google Drive login errors, this file might need to be regenerated and put in{config-dir}/google-drive
. See the gspread documentation for instructions.
Adding New Tool Versions
Because
scbl-utils samplesheet-from-gdrive
is designed to be used in conjunction with thenf-tenx
pipeline, it queries the container collection the pipeline pulls from to know what versions of tools are available. If you have the required permissions, you can simply add a new definition file to the registry, ensuring that the file name you add follows the convention of the other definition files. Note that you won't be able to access the registry if you're not connected to JAX WiFi.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scbl_utils-1.15-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 432e9c771a73ed5fa9139f363be1e337d5e8ba0a538f2ba45719caa3ef67a769 |
|
MD5 | 50353c2685364889c5a2d13d5de7a6fe |
|
BLAKE2b-256 | c5f5c857289b7c82e5d3a371f320298f268c6a3b8baf8022178fc6e849c25def |