Format-agnostic parser for Illumina SampleSheet.csv files — supports IEM V1 and BCLConvert V2

These details have not been verified by PyPI

Project links

Project description

samplesheet-parser

Format-agnostic Python parser for Illumina SampleSheet.csv files.

Supports IEM V1 (bcl2fastq / NovaSeq 6000 era) and BCLConvert V2 (NovaSeq X series) with automatic format detection, index validation, and OverrideCycles / UMI decoding — no format hints required from the caller.

The problem

Labs running mixed instrument fleets — NovaSeq 6000 alongside NovaSeq X series — produce two structurally incompatible SampleSheet.csv formats:

	IEM V1	BCLConvert V2
Discriminator	`IEMFileVersion` in `[Header]`	`FileFormatVersion` in `[Header]`
Data section	`[Data]`	`[BCLConvert_Data]`
Settings section	`[Settings]`	`[BCLConvert_Settings]`
Index columns	`index`, `index2` (lowercase)	`Index`, `Index2` (uppercase)
Read cycles	Bare integers	Key-value (`Read1Cycles,151`)
UMI encoding	Not supported	`OverrideCycles` string
Used with	bcl2fastq	BCLConvert ≥ 3.x

Without a single parser, every pipeline component that reads a SampleSheet needs an if v1 else v2 branch — or worse, the format is hardcoded and the wrong sheet is silently processed.

Installation

pip install samplesheet-parser

Requires Python 3.10+. The only mandatory dependency is loguru.

Quickstart

Auto-detect format (recommended)

from samplesheet_parser import SampleSheetFactory

factory = SampleSheetFactory()
sheet = factory.create_parser("SampleSheet.csv", parse=True)

print(factory.version)           # SampleSheetVersion.V1 or .V2
print(sheet.index_type())        # "dual", "single", or "none"
print(factory.get_umi_length())  # 0 if no UMI

for sample in sheet.samples():
    print(sample["sample_id"], sample["index"])

Validate before demultiplexing

from samplesheet_parser import SampleSheetFactory, SampleSheetValidator

sheet = SampleSheetFactory().create_parser("SampleSheet.csv", parse=True)
result = SampleSheetValidator().validate(sheet)

print(result.summary())
# PASS — 0 error(s), 1 warning(s)

for err in result.errors:
    print(err)
# [ERROR] DUPLICATE_INDEX: Index 'ATTACTCG+TATAGCCT' appears more than once in lane 1.

UMI extraction (V2 only)

from samplesheet_parser import SampleSheetV2

sheet = SampleSheetV2("SampleSheet.csv", parse=True)

# OverrideCycles: Y151;I10U9;I10;Y151 → 9 bp UMI in Index1
print(sheet.get_umi_length())        # 9
rs = sheet.get_read_structure()
print(rs.umi_location)               # "index2"
print(rs.read_structure)             # {"read1_template": 151, "index2_length": 10, "index2_umi": 9, ...}

Use parsers directly

from samplesheet_parser import SampleSheetV1, SampleSheetV2

# V1
v1 = SampleSheetV1("SampleSheet_v1.csv", parse=True)
print(v1.experiment_name)    # "240115_A01234_0042_AHJLG7DRXX"
print(v1.instrument_type)    # "NovaSeq 6000"
print(v1.adapter_read1)      # "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
print(v1.adapter_read2)      # "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
print(v1.reverse_complement) # 0
print(v1.read_lengths)       # [151, 151]

# V2
v2 = SampleSheetV2("SampleSheet_v2.csv", parse=True)
print(v2.instrument_platform)  # "NovaSeqXSeries"
print(v2.software_version)     # "3.9.3"

Format detection logic

The factory uses a three-step strategy — stopping as early as possible:

1. Scan [Header] for FileFormatVersion  → V2
                  or IEMFileVersion     → V1

2. If undetermined: scan full file for
   [BCLConvert_Settings] or
   [BCLConvert_Data]                    → V2

3. Default                              → V1

The file is read only once. No second open(), no seek().

OverrideCycles decoding

The V2 OverrideCycles string encodes the full read structure using single-letter type codes:

Code	Meaning
`Y`	Template (sequenced) bases
`I`	Index bases
`U`	UMI bases
`N`	Masked / skipped bases

Segment order: Read1 ; Index1 ; Index2 ; Read2

OverrideCycles	UMI length	UMI location
`Y151;I10;I10;Y151`	0	—
`Y151;I10U9;I10;Y151`	9 bp	Index1
`U5Y146;I8;I8;U5Y146`	5 bp	Read1 + Read2
`Y76;I8;Y76`	0	— (single-index)

Validation checks

Code	Level	Condition
`EMPTY_SAMPLES`	error	No samples found in data section
`INVALID_INDEX_CHARS`	error	Index contains non-ACGTN characters
`INDEX_TOO_LONG`	error	Index longer than 24 bp
`DUPLICATE_INDEX`	error	Two samples share an index in the same lane
`DUPLICATE_SAMPLE_ID`	error	Same `Sample_ID` appears twice in one lane
`INDEX_TOO_SHORT`	warning	Index shorter than 6 bp
`NO_ADAPTERS`	warning	No adapter sequences configured
`ADAPTER_MISMATCH`	warning	Adapter is not a standard Illumina sequence

API reference

`SampleSheetFactory`

factory = SampleSheetFactory()
sheet = factory.create_parser(path, *, clean=True, experiment_id=None, parse=None)

Attribute / Method	Returns	Description
`.create_parser(path, ...)`	`SampleSheetV1 \| SampleSheetV2`	Auto-detect format and return parser
`.get_umi_length()`	`int`	UMI length from current parser
`.version`	`SampleSheetVersion`	Detected version after `create_parser()`

Shared interface — `SampleSheetV1` and `SampleSheetV2`

Method / Attribute	Returns	Description
`.parse(do_clean=True)`	`None`	Parse all sections
`.samples()`	`list[dict]`	One record per unique sample
`.index_type()`	`str`	`"dual"`, `"single"`, or `"none"`
`.adapters`	`list[str]`	All configured adapter sequences
`.experiment_name`	`str \| None`	Run or experiment name
`.read_lengths` / `.reads`	`list[int]` / `dict`	Read cycle lengths

V1-specific

Attribute	Type	Description
`.iem_version`	`str \| None`	e.g. `"5"`
`.instrument_type`	`str \| None`	e.g. `"NovaSeq 6000"`, `"MiSeq"`
`.application`	`str \| None`	e.g. `"FASTQ Only"`
`.assay`	`str \| None`	Library prep kit name
`.index_adapters`	`str \| None`	Illumina index set name
`.chemistry`	`str \| None`	`"Amplicon"` = dual index, `"Default"` = single/no index
`.adapter_read1`	`str`	Read 1 adapter (`Adapter` or `AdapterRead1` key)
`.adapter_read2`	`str`	Read 2 adapter (`AdapterRead2` key)
`.reverse_complement`	`int`	`0` = default, `1` = reverse-complement R2 (Nextera MP only)
`.flowcell_id`	`str \| None`	Parsed from experiment ID run folder name

V2-specific

Method / Attribute	Returns	Description
`.get_umi_length()`	`int`	UMI length from `OverrideCycles`
`.get_read_structure()`	`ReadStructure`	Full decoded read structure
`.instrument_platform`	`str \| None`	e.g. `"NovaSeqXSeries"`
`.software_version`	`str \| None`	BCLConvert version string
`.custom_fields`	`dict[str, set[str]]`	Non-standard fields by section

Example sample sheets

The examples/sample_sheets/ directory contains ready-to-use reference sheets for every supported configuration:

File	Format	Instrument	UMI	Use case
`v1_dual_index.csv`	V1	NovaSeq 6000	No	Standard WGS, multi-lane
`v1_single_index.csv`	V1	NextSeq 500	No	Small RNA
`v1_multi_lane.csv`	V1	NovaSeq 6000	No	4 lanes, mixed projects
`v2_novaseq_x_dual_index.csv`	V2	NovaSeq X	No	Standard PE150
`v2_with_index_umi.csv`	V2	NovaSeq X	Index1 UMI (9 bp)	cfDNA / liquid biopsy
`v2_with_read_umi.csv`	V2	NovaSeq X	Read UMI (5 bp)	Duplex sequencing
`v2_nextseq_single_index.csv`	V2	NextSeq 1000/2000	No	Amplicon panel

Run the demo to parse all of them:

python examples/parse_examples.py

V1 adapter key reference

From the Illumina IEM specification, the correct V1 [Settings] adapter keys are:

[Settings]
ReverseComplement,0
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Adapter = Read 1 adapter (primary key per IEM spec)
AdapterRead2 = Read 2 adapter (explicit separate key)
AdapterRead1 = BCLConvert V1-mode alias for Adapter (also accepted)
ReverseComplement,1 = only for Nextera Mate Pair libraries; 0 for everything else

Project structure

samplesheet-parser/
├── samplesheet_parser/
│   ├── __init__.py          # Public API
│   ├── factory.py           # SampleSheetFactory — auto-detection
│   ├── enums.py             # SampleSheetVersion, IndexType, ...
│   ├── validators.py        # SampleSheetValidator, ValidationResult
│   └── parsers/
│       ├── v1.py            # IEM V1 parser (bcl2fastq)
│       └── v2.py            # BCLConvert V2 parser (NovaSeq X)
├── tests/
│   ├── conftest.py          # Shared fixtures
│   ├── test_factory.py
│   ├── test_parsers/
│   │   ├── test_v1.py
│   │   └── test_v2.py
│   └── test_validators/
│       └── test_validators.py
├── examples/
│   ├── parse_examples.py    # Demo script
│   └── sample_sheets/       # Reference SampleSheet.csv files
├── pyproject.toml
├── LICENSE
└── README.md

Development

git clone https://github.com/chaitanyakasaraneni/samplesheet-parser
cd samplesheet-parser
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run example demo
python examples/parse_examples.py

Citation

If you use this library in a published pipeline or analysis, please cite:

@software{kasaraneni2026samplsheetparser,
  author  = {Kasaraneni, Chaitanya},
  title   = {samplesheet-parser: Format-agnostic parser for Illumina SampleSheet.csv},
  year    = {2026},
  url     = {https://github.com/chaitanyakasaraneni/samplesheet-parser},
  version = {0.1.3}
}

License

Apache 2.0 — see LICENSE.

Related resources

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.0

May 20, 2026

1.2.0

Apr 14, 2026

1.1.0

Apr 6, 2026

1.0.0

Apr 5, 2026

0.3.4

Apr 5, 2026

0.3.3

Mar 13, 2026

0.3.2

Mar 12, 2026

0.3.1

Mar 12, 2026

0.3.0

Mar 11, 2026

0.2.1

Mar 5, 2026

0.2.0

Feb 26, 2026

0.1.5

Feb 23, 2026

0.1.4

Feb 22, 2026

This version

0.1.3

Feb 22, 2026

0.1.2

Feb 22, 2026

0.1.1

Feb 22, 2026

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samplesheet_parser-0.1.3.tar.gz (30.2 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

samplesheet_parser-0.1.3-py3-none-any.whl (26.3 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file samplesheet_parser-0.1.3.tar.gz.

File metadata

Download URL: samplesheet_parser-0.1.3.tar.gz
Upload date: Feb 22, 2026
Size: 30.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samplesheet_parser-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`28e4197b468268c80d2beee57eee29216a9e5ee5f79302da48e2dfbe467fd527`
MD5	`43f1cf376b03198c00fd007424c74a1b`
BLAKE2b-256	`40108016262a5c318cf7203c8d2926f008442c5a02907a685d9abd031f1fad57`

See more details on using hashes here.

File details

Details for the file samplesheet_parser-0.1.3-py3-none-any.whl.

File metadata

Download URL: samplesheet_parser-0.1.3-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 26.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samplesheet_parser-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`673a7d27af0dcf84f95f5c7dfe339622a279e953e820682ba2d0930e3e734b97`
MD5	`56e6d2b68627fe40f15ce80a2a01d27b`
BLAKE2b-256	`f273fd55e73833709e3f653748947bbaaa4dd2bb491881b636354a44740fb03d`

See more details on using hashes here.

samplesheet-parser 0.1.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

samplesheet-parser

The problem

Installation

Quickstart

Auto-detect format (recommended)

Validate before demultiplexing

UMI extraction (V2 only)

Use parsers directly

Format detection logic

OverrideCycles decoding

Validation checks

API reference

SampleSheetFactory

Shared interface — SampleSheetV1 and SampleSheetV2

V1-specific

V2-specific

Example sample sheets

V1 adapter key reference

Project structure

Development

Citation

License

Related resources

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SampleSheetFactory`

Shared interface — `SampleSheetV1` and `SampleSheetV2`