A package to ananlyze the data generated by Hi-C Capture for ssDNA

# Hi-C ssDNA: a project that analyzes fastq files generated by ssDNA Hi-C Capture developped by Piazza lab

## Description

This project analyzes the sequencing data generated after the ssDNA HiC Capture protocol. This package contains two principal modules:

### 1. oligos_replacement

It generates a genome from the original genome and the new oligos designed in the ssDNA Hi-C Capture protocol. The new genome is a copy of the original exepted for the oligos regions, the sequence is replaced by the oligos sequence. Then, it adds -at the end of the new genome built- a new artificial chromosome named chr_art which is a concatenation of the original sequence of the oligos with their flanking regions.

Also, the program creates a .bed file that contains the coordinates of the oligos in the new genome and in the artificial chromosome and indicates if the sequence is a flanking region or the oligo itself.

With the new genome created, the user can run the hicstuff package and thus creates a fragments_list file and a contacts which are both tsv files (hicstuff generates them as a .txt but they are tsv files). Please check the hicstuff documentation for the structure of those files https://github.com/koszullab/hicstuff#file-formats.

Those two files are required with the correct format to the next module contacts_fitler.

### 2. contacts_filter

This module filters the contacts. It removes the contacts in which none of the fragments in the oligos

## Dependencies

Python3 dependencies:

• pandas
• sys
• getopt

## Installation

The easiest way to install oligos replacement is using pip:

pip3 install hic-ssdna


## Run the program

Once installed, you can run the first main script oligos_replacement like this:

hic-ssdna.oligos_replacement <arg1> <arg2> <arg3> <arg4> <arg5>


It takes five arguments:

• The original genome path
• The oligos file path
• An output path where will be created the new genome
• An output path where will be created the .bed file
• The lengh of the flanking region you want

The second main script contacts_filter:

hic-ssdna.contacts_filter <arg1> <arg2> <arg3> <arg4>


It takes four arguments:

• The oligos file path -o <oligos_input.csv>
• The fragments file path -f <fragments_input.txt> (produced by hicstuff)
• The contacts file path -c <contacts_input.txt> (produced by hicstuff)
• An output path where the filtered contacts file will be save -O <output_contacts_filtered.csv>

You can call the script with the -h argument to see

## Formats and conventions

This project has to be used with the following instructions to work correctly.

### Files formats

• Genomes: fasta
• Oligos file: csv (with col sep = ',')

### Oligo file structure

This file has to contained at least 6 columns with the precises headers below:

chr start end orientation type name sequence_original sequence_modified
• In thechr column, it has to be the entire line of the chromome description in the fasta file without the chevron >

• In the start column, the position of the first nucleotide (included) of the oligo (the first nucleotide of the chromosome is the number 1)

• In the end column, the position of the last nucleotide (included) of the oligo

• In the orientation column, C for Crick and W for Watson

• In the type column, ss (ssDNA HiC oligos captured), ss_neg ssDNA negative control (ssDNA HiC oligos not captured), ds (dsDNA HiC oligos captured), ds_neg dsDNA negative control (dsDNA HiC oligos not captured)

• In the name column, write the name of the oligo, all names must be different

• In the sequence_original column, the original sequence of the oligo

• In the sequence_original column, the modified sequence of the oligo

The first oligo is the number 0.

### .bed file structure

The .bed file generated is a bed4.

• The first column indicates the chromosome name (the classic genome or the artificial chromosome)
• The second column contains the position of the first nucleotide (included)
• The third column contains the position of the last nucleotide (included)
• The fourth column is oligo x with x the oligo number and followed by flank 5' or flank 3' (if the flanking sequence is in 3' or 5' side) and nothing after if the sequence is the oligo itself.

$$x = a +b$$

## Project details

Uploaded source
Uploaded py3