Skip to main content

Designing CRISPR-Cas guide RNAs in bacteria.

Project description

🌱 crispio

GitHub Workflow Status PyPI - Python Version PyPI

Design and annotate bacterial CRISPRi guide RNA libraries from any genome.

CRISPRi uses a catalytically dead Cas9 to silence genes by blocking transcription. Designing a good library means knowing not just where a guide targets, but how far upstream of the TSS it lands, which replichore it sits on, whether it shares a seed sequence with another guide, and whether it contains a restriction site that would break your cloning. crispio computes all of this in one pass and outputs annotated GFF3 that loads directly into any genome browser.

crispio generate --pam Spy -g genome.fasta -a genome.gff3 > guides.gff

Quick start

You need two files, both available for any sequenced bacterium from NCBI:

  • FASTA — the genome sequence (.fasta / .fa)
  • GFF3 — gene annotations (.gff / .gff3)

Try crispio on the first 100 guides straight away with --limit:

crispio generate \
  --pam Spy \
  --genome EcoMG1655-NC_000913.3.fasta \
  --annotations EcoMG1655-NC_000913.3.gff3 \
  --limit 100 \
  > first100.gff

Convert to a spreadsheet-friendly table with bioino:

cat first100.gff | bioino gff2table > first100.tsv

Open first100.tsv in Excel. Each row is one guide. The most useful columns at a glance:

Column Example What it means
Name thrL-21-modest_saddle gene-position-mnemonic
guide_sequence GCTTTTCATTCTGACTGCAA The 20 nt spacer to synthesise
pam_offset -166 Distance from PAM to gene start. Negative = upstream of TSS — the productive targeting window for CRISPRi
pam_replichore R Left or right replichore — matters for efficiency in fast-growing bacteria
ann_locus_tag b0001 Systematic gene ID for programmatic filtering
guide_re_sites BbsI Restriction sites in the spacer that would break Golden Gate cloning

What you get

Every guide gets a stable, human-readable mnemonicmodest_saddle, bouncy_sabine — that is a deterministic hash of the guide sequence, PAM, and position. The same guide always gets the same mnemonic regardless of when you run crispio or what else is in the library. Use it to refer to guides in lab notebooks and across collaborators without copying 20-character sequences.

The pam_offset is signed: negative means the PAM is upstream of the annotated gene start, which is the productive targeting window for bacterial CRISPRi. Positive values target inside the coding sequence. Filter on it directly:

cat guides.gff | bioino gff2table \
  | awk -F'\t' 'NR==1 || ($NF+0 < 0 && $NF+0 > -300)' \
  > upstream_guides.tsv

Output is standard GFF3 and loads as an annotation track in IGV and Artemis — useful for visually checking guide distribution across the chromosome before ordering.


Generating a new library

crispio generate finds every PAM site in the genome, extracts the adjacent spacer, and annotates everything in one pass.

crispio generate \
  --pam Sth1 \
  --max_length 20 \
  --genome EcoMG1655-NC_000913.3.fasta \
  --annotations EcoMG1655-NC_000913.3.gff3 \
  --output guides.gff

For multi-chromosome genomes (chromosome + plasmids), pass a FASTA with multiple sequences. Each sequence is processed independently and guides are tagged with the correct chromosome identifier.

Use --limit N for quick exploratory runs or to generate a capped sub-library:

crispio generate --pam Spy -g genome.fasta -a genome.gff3 --limit 500

Annotating guides from the literature

This is one of the most useful things crispio does: take a published guide library and fully re-annotate it against your genome. It doesn't require matching coordinates or assemblies — it searches by sequence, so it works across strains.

If you have a TSV with a sequence column and a guide_name column:

cat published_library.tsv \
  | bioino table2fasta --sequence sequence --name guide_name \
  | crispio map \
      --pam Spy \
      --genome EcoMG1655-NC_000913.3.fasta \
      --annotations EcoMG1655-NC_000913.3.gff3 \
  > annotated_library.gff

Or from an existing FASTA of spacers:

crispio map \
  published_spacers.fasta \
  --pam Spy \
  --genome EcoMG1655-NC_000913.3.fasta \
  --annotations EcoMG1655-NC_000913.3.gff3 \
  > annotated_library.gff

Guides not found in the genome are reported to stderr and skipped — they never appear silently with wrong coordinates.


Checking for off-targets

crispio offtarget flags pairs of guides that share a 4 nt PAM-proximal seed sequence and differ by ≤ 4 mismatches elsewhere. These are candidates for unintended cross-silencing.

# Check a library against itself
crispio offtarget --gff2 guides.gff < guides.gff > checked.gff

Flagged guides get a crosstalk attribute listing the IDs and distances of matches. Check two libraries against each other — for example, confirming that guides from one experiment won't interfere with another:

crispio offtarget --gff2 library_b.gff < library_a.gff > crosstalk.gff

Adding ML features

crispio featurize appends sequence-based features for downstream activity prediction, prefixed feat_ in the output.

cat guides.gff | crispio featurize --scaffold Sth1 > guides_featurized.gff

Available features:

>>> from crispio import get_features
>>> get_features()
['on_nontemplate_strand', 'context_up2', 'context_down2', 'context_up_autocorr',
 'pam_n', 'pam_def', 'pam_gc', 'pam_autocorr', 'pam_scaff_corr',
 'guide_purine', 'guide_gc', 'seed_seq', 'guide_start3', 'guide_end3',
 'guide_autocorr', 'guide_scaff_corr']

--scaffold takes a name (Sth1, PerturbSeq) or a raw scaffold sequence. Use the scaffold for the Cas9 you are working with — the correlation-based features depend on it.


Piping commands together

All subcommands read from stdin and write to stdout. Informational messages go to stderr only, so they never appear in your data stream. Full pipelines with no intermediate files:

# Generate → featurize → table
crispio generate --pam Spy -g genome.fasta -a genome.gff3 \
  | crispio featurize --scaffold Sth1 \
  | bioino gff2table \
  > full_library.tsv
# Map a published library → off-target check → table
cat published_spacers.fasta \
  | crispio map --pam Spy -g genome.fasta -a genome.gff3 \
  | crispio offtarget -2 <(crispio generate --pam Spy -g genome.fasta -a genome.gff3) \
  | bioino gff2table \
  > mapped_checked.tsv

Python API

Generate guides de novo:

from crispio import GuideLibrary

genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
gl = GuideLibrary.from_generating(genome=genome, pam_search="NGG")

for match_collection in gl:
    for guide in match_collection:
        print(guide)
# ATACCGTTTTTTTAAAAAAA
# TATCCGTTTTTTTAAAAAAA

Map known sequences to a genome:

from crispio import GuideLibrary

genome = "CCCCCCCCCCCTTTTTTTTTTAAAAAAAAAATGATCGATCGATCGAGGAAAAAAAAAACCCCCCCCCCC"
gl = GuideLibrary.from_mapping(
    guide_seq=["ATGATCGATCGATCG"],
    genome=genome,
    pam_search="NGG",
)

for collection in gl:
    for match in collection:
        print(match.guide_seq, match.pam_start, match.reverse)

Calculate features:

from crispio import featurize
from crispio.utils import sequences

# gff_line is a bioino.GffLine with guide_sequence, pam_sequence, etc.
scaffold_seq = sequences.scaffolds["Sth1"]
features = featurize(gff_line, scaffold=scaffold_seq)
# {"feat_guide_gc": "0.500", "feat_seed_seq": "GATCG", ...}

Pass the scaffold sequence, not the name, to featurize. Use sequences.scaffolds["Sth1"] to retrieve it.

Full API reference: crispio.readthedocs.io


Installation

Requires Python ≥ 3.10.

pip install crispio

Verify:

crispio --help

From source:

git clone https://github.com/scbirlab/crispio.git
cd crispio
pip install -e .

PAMs and scaffolds

Built-in PAM names for --pam:

Name IUPAC Cas9
Spy NGGN SpCas9 (S. pyogenes)
Sth1 NNRGVAN StCas9-1 (S. thermophilus)
Sau NGRRT SaCas9 (S. aureus)
Nme NNNNGAT NmeCas9 (N. meningitidis)

Built-in scaffold names for --scaffold:

Name Description
Sth1 StCas9-1 scaffold
PerturbSeq Perturb-seq optimised scaffold

Any IUPAC sequence can be passed directly to either argument.


Issues and documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crispio-0.0.6.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crispio-0.0.6-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file crispio-0.0.6.tar.gz.

File metadata

  • Download URL: crispio-0.0.6.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for crispio-0.0.6.tar.gz
Algorithm Hash digest
SHA256 297e6f27d0fd9fde9be35e5c13819732eb73e8567af1eea99e038be94babd0dc
MD5 a6fd1d7153bacad7e9b029d72b5b1c8a
BLAKE2b-256 d22c976209e0294006e2040e1cec1094d69fbf5af0af033942abe5299badb572

See more details on using hashes here.

File details

Details for the file crispio-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: crispio-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for crispio-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 15f2aed6b757b432b1f278a32df546a5f20617a0a14da26c7e95f6ef84664451
MD5 bc5471326c47022528c035b35260b826
BLAKE2b-256 35566b9ba18c6636a84baf6b22492457a4698ebd278d826e93cef874d2e2a75b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page