Skip to main content

Format-agnostic parser for Illumina SampleSheet.csv files — supports IEM V1 and BCLConvert V2

Project description

samplesheet-parser

Format-agnostic Python parser for Illumina SampleSheet.csv files.

Supports IEM V1 (bcl2fastq / NovaSeq 6000 era) and BCLConvert V2 (NovaSeq X series) with automatic format detection, index validation, and OverrideCycles / UMI decoding — no format hints required from the caller.

PyPI version Python 3.10+ License: Apache 2.0 Tests codecov


The problem

Labs running mixed instrument fleets — NovaSeq 6000 alongside NovaSeq X series — produce two structurally incompatible SampleSheet.csv formats:

IEM V1 BCLConvert V2
Discriminator IEMFileVersion in [Header] FileFormatVersion in [Header]
Data section [Data] [BCLConvert_Data]
Settings section [Settings] [BCLConvert_Settings]
Index columns index, index2 (lowercase) Index, Index2 (uppercase)
Read cycles Bare integers Key-value (Read1Cycles,151)
UMI encoding Not supported OverrideCycles string
Used with bcl2fastq BCLConvert ≥ 3.x

Without a single parser, every pipeline component that reads a SampleSheet needs an if v1 else v2 branch — or worse, the format is hardcoded and the wrong sheet is silently processed.


Installation

pip install samplesheet-parser

Requires Python 3.10+. The only mandatory dependency is loguru.


Quickstart

Auto-detect format (recommended)

from samplesheet_parser import SampleSheetFactory

factory = SampleSheetFactory()
sheet = factory.create_parser("SampleSheet.csv", parse=True)

print(factory.version)           # SampleSheetVersion.V1 or .V2
print(sheet.index_type())        # "dual", "single", or "none"
print(factory.get_umi_length())  # 0 if no UMI

for sample in sheet.samples():
    print(sample["sample_id"], sample["index"])

Validate before demultiplexing

from samplesheet_parser import SampleSheetFactory, SampleSheetValidator

sheet = SampleSheetFactory().create_parser("SampleSheet.csv", parse=True)
result = SampleSheetValidator().validate(sheet)

print(result.summary())
# PASS — 0 error(s), 1 warning(s)

for err in result.errors:
    print(err)
# [ERROR] DUPLICATE_INDEX: Index 'ATTACTCG+TATAGCCT' appears more than once in lane 1.

UMI extraction (V2 only)

from samplesheet_parser import SampleSheetV2

sheet = SampleSheetV2("SampleSheet.csv", parse=True)

# OverrideCycles: Y151;I10U9;I10;Y151 → 9 bp UMI in Index1
print(sheet.get_umi_length())        # 9
rs = sheet.get_read_structure()
print(rs.umi_location)               # "index2"
print(rs.read_structure)             # {"read1_template": 151, "index2_length": 10, "index2_umi": 9, ...}

Use parsers directly

from samplesheet_parser import SampleSheetV1, SampleSheetV2

# V1
v1 = SampleSheetV1("SampleSheet_v1.csv", parse=True)
print(v1.experiment_name)    # "240115_A01234_0042_AHJLG7DRXX"
print(v1.instrument_type)    # "NovaSeq 6000"
print(v1.adapter_read1)      # "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
print(v1.adapter_read2)      # "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
print(v1.reverse_complement) # 0
print(v1.read_lengths)       # [151, 151]

# V2
v2 = SampleSheetV2("SampleSheet_v2.csv", parse=True)
print(v2.instrument_platform)  # "NovaSeqXSeries"
print(v2.software_version)     # "3.9.3"

Format detection logic

The factory uses a three-step strategy — stopping as early as possible:

1. Scan [Header] for FileFormatVersion  → V2
                  or IEMFileVersion     → V1

2. If undetermined: scan full file for
   [BCLConvert_Settings] or
   [BCLConvert_Data]                    → V2

3. Default                              → V1

The file is read only once. No second open(), no seek().


OverrideCycles decoding

The V2 OverrideCycles string encodes the full read structure using single-letter type codes:

Code Meaning
Y Template (sequenced) bases
I Index bases
U UMI bases
N Masked / skipped bases

Segment order: Read1 ; Index1 ; Index2 ; Read2

OverrideCycles UMI length UMI location
Y151;I10;I10;Y151 0
Y151;I10U9;I10;Y151 9 bp Index1
U5Y146;I8;I8;U5Y146 5 bp Read1 + Read2
Y76;I8;Y76 0 — (single-index)

Validation checks

Code Level Condition
EMPTY_SAMPLES error No samples found in data section
INVALID_INDEX_CHARS error Index contains non-ACGTN characters
INDEX_TOO_LONG error Index longer than 24 bp
DUPLICATE_INDEX error Two samples share an index in the same lane
DUPLICATE_SAMPLE_ID error Same Sample_ID appears twice in one lane
INDEX_TOO_SHORT warning Index shorter than 6 bp
NO_ADAPTERS warning No adapter sequences configured
ADAPTER_MISMATCH warning Adapter is not a standard Illumina sequence

API reference

SampleSheetFactory

factory = SampleSheetFactory()
sheet = factory.create_parser(path, *, clean=True, experiment_id=None, parse=None)
Attribute / Method Returns Description
.create_parser(path, ...) SampleSheetV1 | SampleSheetV2 Auto-detect format and return parser
.get_umi_length() int UMI length from current parser
.version SampleSheetVersion Detected version after create_parser()

Shared interface — SampleSheetV1 and SampleSheetV2

Method / Attribute Returns Description
.parse(do_clean=True) None Parse all sections
.samples() list[dict] One record per unique sample
.index_type() str "dual", "single", or "none"
.adapters list[str] All configured adapter sequences
.experiment_name str | None Run or experiment name
.read_lengths / .reads list[int] / dict Read cycle lengths

V1-specific

Attribute Type Description
.iem_version str | None e.g. "5"
.instrument_type str | None e.g. "NovaSeq 6000", "MiSeq"
.application str | None e.g. "FASTQ Only"
.assay str | None Library prep kit name
.index_adapters str | None Illumina index set name
.chemistry str | None "Amplicon" = dual index, "Default" = single/no index
.adapter_read1 str Read 1 adapter (Adapter or AdapterRead1 key)
.adapter_read2 str Read 2 adapter (AdapterRead2 key)
.reverse_complement int 0 = default, 1 = reverse-complement R2 (Nextera MP only)
.flowcell_id str | None Parsed from experiment ID run folder name

V2-specific

Method / Attribute Returns Description
.get_umi_length() int UMI length from OverrideCycles
.get_read_structure() ReadStructure Full decoded read structure
.instrument_platform str | None e.g. "NovaSeqXSeries"
.software_version str | None BCLConvert version string
.custom_fields dict[str, set[str]] Non-standard fields by section

Example sample sheets

The examples/sample_sheets/ directory contains ready-to-use reference sheets for every supported configuration:

File Format Instrument UMI Use case
v1_dual_index.csv V1 NovaSeq 6000 No Standard WGS, multi-lane
v1_single_index.csv V1 NextSeq 500 No Small RNA
v1_multi_lane.csv V1 NovaSeq 6000 No 4 lanes, mixed projects
v2_novaseq_x_dual_index.csv V2 NovaSeq X No Standard PE150
v2_with_index_umi.csv V2 NovaSeq X Index1 UMI (9 bp) cfDNA / liquid biopsy
v2_with_read_umi.csv V2 NovaSeq X Read UMI (5 bp) Duplex sequencing
v2_nextseq_single_index.csv V2 NextSeq 1000/2000 No Amplicon panel

Run the demo to parse all of them:

python examples/parse_examples.py

V1 adapter key reference

From the Illumina IEM specification, the correct V1 [Settings] adapter keys are:

[Settings]
ReverseComplement,0
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  • Adapter = Read 1 adapter (primary key per IEM spec)
  • AdapterRead2 = Read 2 adapter (explicit separate key)
  • AdapterRead1 = BCLConvert V1-mode alias for Adapter (also accepted)
  • ReverseComplement,1 = only for Nextera Mate Pair libraries; 0 for everything else

Project structure

samplesheet-parser/
├── samplesheet_parser/
│   ├── __init__.py          # Public API
│   ├── factory.py           # SampleSheetFactory — auto-detection
│   ├── enums.py             # SampleSheetVersion, IndexType, ...
│   ├── validators.py        # SampleSheetValidator, ValidationResult
│   └── parsers/
│       ├── v1.py            # IEM V1 parser (bcl2fastq)
│       └── v2.py            # BCLConvert V2 parser (NovaSeq X)
├── tests/
│   ├── conftest.py          # Shared fixtures
│   ├── test_factory.py
│   ├── test_parsers/
│   │   ├── test_v1.py
│   │   └── test_v2.py
│   └── test_validators/
│       └── test_validators.py
├── examples/
│   ├── parse_examples.py    # Demo script
│   └── sample_sheets/       # Reference SampleSheet.csv files
├── pyproject.toml
├── LICENSE
└── README.md

Development

git clone https://github.com/chaitanyakasaraneni/samplesheet-parser
cd samplesheet-parser
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run example demo
python examples/parse_examples.py

Citation

If you use this library in a published pipeline or analysis, please cite:

@software{kasaraneni2026samplsheetparser,
  author  = {Kasaraneni, Chaitanya},
  title   = {samplesheet-parser: Format-agnostic parser for Illumina SampleSheet.csv},
  year    = {2026},
  url     = {https://github.com/chaitanyakasaraneni/samplesheet-parser},
  version = {0.1.4}
}

License

Apache 2.0 — see LICENSE.


Related resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samplesheet_parser-0.1.5.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samplesheet_parser-0.1.5-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file samplesheet_parser-0.1.5.tar.gz.

File metadata

  • Download URL: samplesheet_parser-0.1.5.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samplesheet_parser-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b2079e6d26a8810896b468268f54edf34322dd649f4dcbc0c0883f52d35c12d9
MD5 70151d57a726275fe5f8921fc0a81e59
BLAKE2b-256 6f0072edb26d674b7cb8d2f3b97aa941424d280d4b82158cd2b731ab16c2c9ef

See more details on using hashes here.

File details

Details for the file samplesheet_parser-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for samplesheet_parser-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 fa4292832467578ea7f053cded554fec52f8cb72793fbe876284fb2be5d00689
MD5 e8c88e29e080ef729eaf8faf33b340c4
BLAKE2b-256 b60c01f55a9311a4d5ca7c23be149db0e801fb7e8a0334be3c4e1496ba2b1aad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page