Skip to main content

Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data

Project description

SEQLinkage

Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data

Features

  • It can do linkage analysis on single variants and CHP markers.
  • It can deal with families from different population.
  • It can handle large-scale whole-genome linkage analysis.

Pre-requisites

Make sure you install the pre-requisited before running seqlink:

#install cstatgen
conda install -c conda-forge xeus-cling
conda install -c anaconda swig 
conda install -c conda-forge gsl
pip install egglib
git clone https://github.com/statgenetics/cstatgen.git
cd cstatgen
python setup.py install

#install paramlink2
R
install.packages("paramlink2")

Install

pip install SEQLinkage

How to use

!seqlink --help
usage: seqlink [-h] [--chp] --fam FILE --vcf FILE [--anno FILE] [--pop FILE]
               [--included-vars FILE] [-b FILE] [-c P] [-o Name]
               [--build STRING] [--freq INFO] [--chrom-prefix STRING]
               [--run-linkage] [-K FLOAT] [--moi STRING] [-W FLOAT] [-M FLOAT]
               [--theta-max FLOAT] [--theta-inc FLOAT]

SEQLinkage V2, linkage analysis using sequence data

options:
  -h, --help            show this help message and exit

Collapsed haplotype pattern method arguments:
  --chp                 Generate CHP markers.
  --fam FILE            Input pedigree and phenotype information in FAM
                        format.
  --vcf FILE            Input VCF file, bgzipped.
  --anno FILE           Input annotation file from annovar.
  --pop FILE            Input two columns file, first column is family ID,
                        second column population information.
  --included-vars FILE  Variants to be included for linkage analysis, if None,
                        the analysis won't filter any variants. But you can
                        still set AF cutoff by -c argment.
  -b FILE, --blueprint FILE
                        Blueprint file that defines regional marker (format:
                        "chr startpos endpos name avg.distance male.distance
                        female.distance").
  -c P, --maf-cutoff P  MAF cutoff to define variants to be excluded from
                        analyses. this is useful, if you need to analyse
                        multiple population together.
  -o Name, --output Name
                        Output name prefix.
  --build STRING        Reference genome version for VCF file.
  --freq INFO           Info field name for allele frequency in VCF file.
  --chrom-prefix STRING
                        Prefix to chromosome name in VCF file if applicable,
                        e.g. "chr".

LINKAGE options:
  --run-linkage         Perform Linkage analysis.
  -K FLOAT, --prevalence FLOAT
                        Disease prevalence. Default to 0.001.
  --moi STRING          Mode of inheritance, AD/AR: autosomal
                        dominant/recessive. Default to AD.
  -W FLOAT, --wt-pen FLOAT
                        Penetrance for wild type. Default to 0.01.
  -M FLOAT, --mut-pen FLOAT
                        Penetrance for mutation. Default to 0.9.
  --theta-max FLOAT     Theta upper bound. Default to 0.5.
  --theta-inc FLOAT     Theta increment. Default to 0.05.

Linkage analysis on specific regions

Normally, the regions are gene regions. you can also use self-defined regions, such as promoter regions, enhancer regions.

1.run seqlink on CHP marker

!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp --chp --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_chp', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...
MESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units
SNVHap MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
MESSAGE: write to pickle: data/test_chp/chr1result/chr1result0.pickle,Gene number:4,Time:0.00012452216405007575
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.3767304867506027
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============

2.run seqlink on variants

!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_var --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Namespace(chp=False, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_var', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...
MESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units
MESSAGE: write to pickle: data/test_var/chr1result/chr1result0.pickle,Gene number:4,Time:3.229235919813315e-05
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.6164871137589216
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============

No annotation

If you don't have the annotation file. there is no need to add --pop. And --freq should be setted based on the INFO column in vcf file.

!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --freq='AF' --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp_na --chp --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno=None, pop=None, included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_chp_na', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...
MESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units
SNVHap MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
MESSAGE: write to pickle: data/test_chp_na/chrallresult/chrallresult0.pickle,Gene number:4,Time:6.505574979301956e-05
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.input
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.lods
0.3394373469054699
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============

Whole-genome linkage analysis

if --blueprint is not provided, the genomic region will be seperated to pseudogenes with 1000 variants.

!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt -c 0.05 -o data/test_chp_wg --chp --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Generate regions by annotation
MESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint=None, maf_cutoff=0.05, output='data/test_chp_wg', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: separate chromosome to regions
MESSAGE: 6 families with a total of 18 samples will be scanned for 1 pre-defined units
MESSAGE: write to pickle: data/test_chp_wg/chr1result/chr1result0.pickle,Gene number:1,Time:0.00011511449753824207
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.23506776429712772
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============

Input format

  • --fam, Fam file (required, format: "fid iid fathid mothid sex trait[1 control, 2 case, -9 or 0 missing]")
%%writefile testdata/test_ped.fam
1033    1033_2  0       0       2       -9
1033    1033_1  0       0       1       -9
1033    1033_99 1033_1  1033_2  2       1
1033    1033_9  1033_1  1033_2  2       1
1033    1033_3  1033_1  1033_2  2       2
1036    1036_99 1036_1  1036_2  2       2
1036    1036_6  0       0       1       2
1036    1036_1  0       0       1       -9
1036    1036_3  1036_6  1036_99 2       1
1036    1036_4  1036_6  1036_99 2       1
1036    1036_2  0       0       2       -9
1036    1036_5  1036_6  1036_99 1       1
10J_103 10J_103_10      0       0       1       -9
10J_103 10J_103_4       0       0       1       -9
10J_103 10J_103_3       0       0       2       -9
10J_103 10J_103_2       10J_103_4       10J_103_3       2       2
10J_103 10J_103_1       10J_103_10      10J_103_3       1       2
10J_109 10J_109_2       10J_109_6       10J_109_5       1       2
10J_109 10J_109_3       10J_109_6       10J_109_5       1       2
10J_109 10J_109_4       10J_109_6       10J_109_5       1       2
10J_109 10J_109_6       0       0       1       -9
10J_109 10J_109_1       10J_109_6       10J_109_5       1       2
10J_109 10J_109_5       0       0       2       2
10J_109 10J_109_7       10J_109_6       10J_109_5       1       1
10J_112 10J_112_3       0       0       1       1
10J_112 10J_112_5       10J_112_3       10J_112_2       1       2
10J_112 10J_112_1       10J_112_3       10J_112_2       2       1
10J_112 10J_112_7       10J_112_3       10J_112_2       1       1
10J_112 10J_112_2       0       0       2       2
10J_119 10J_119_2       0       0       1       1
10J_119 10J_119_5       0       0       2       1
10J_119 10J_119_4       0       0       1       1
10J_119 10J_119_6       10J_119_4       10J_119_5       1       2
10J_119 10J_119_7       10J_119_4       10J_119_5       2       2
10J_119 10J_119_1       10J_119_4       10J_119_5       2       2
10J_119 10J_119_3       10J_119_2       10J_119_1       1       1
Overwriting ../testdata/test_ped.fam
  • --vcf, VCF file (required, vcf.gz with vcf.gz.tbi)
bgzip -c file.vcf > file.vcf.gz
tabix -p vcf file.vcf.gz
  • --anno, Annotation file from ANNOVAR, It must contains the allele frequency for population in the file of family population information. For example in here, The annotation file must have AF_amr, AF_afr, AF_nfe columns.
anno=pd.read_csv('testdata/test_chr1_anno.csv')
anno
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene ... CLNDISDB CLNREVSTAT CLNSIG DN ID Patient ID Phenotype Platform Study Pubmed ID Otherinfo1
0 1 10140 10147 ACCCTAAC A intergenic NONE;DDX11L1 dist=NONE;dist=1727 . . ... . . . . . . . . . chr1:10140:ACCCTAAC:A
1 1 10146 10147 AC A intergenic NONE;DDX11L1 dist=NONE;dist=1727 . . ... . . . . . . . . . chr1:10146:AC:A
2 1 10146 10148 ACC * . . . . . ... . . . . . . . . . chr1:10146:ACC:*
3 1 10150 10151 CT C intergenic NONE;DDX11L1 dist=NONE;dist=1723 . . ... . . . . . . . . . chr1:10150:CT:C
4 1 10172 10177 CCCTAA C intergenic NONE;DDX11L1 dist=NONE;dist=1697 . . ... . . . . . . . . . chr1:10172:CCCTAA:C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 1 66479 66487 TATTTATAG * . . . . . ... . . . . . . . . . chr1:66479:TATTTATAG:*
996 1 66480 66481 AT A intergenic FAM138A;OR4F5 dist=30399;dist=2610 . . ... . . . . . . . . . chr1:66480:AT:A
997 1 66480 66483 ATTT * . . . . . ... . . . . . . . . . chr1:66480:ATTT:*
998 1 66481 66488 TTTATAGA T intergenic FAM138A;OR4F5 dist=30400;dist=2603 . . ... . . . . . . . . . chr1:66481:TTTATAGA:T
999 1 66481 66488 TTTATAGA * . . . . . ... . . . . . . . . . chr1:66481:TTTATAGA:*

1000 rows × 152 columns

anno.loc[:,['AF_amr', 'AF_afr', 'AF_nfe']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AF_amr AF_afr AF_nfe
0 0.0006 0.0008 0.0007
1 0.6380 0.6300 0.6413
2 . . .
3 0.0357 0.0426 0.0370
4 0.0086 0.0097 0.0084
... ... ... ...
995 . . .
996 0.0039 0.0036 0.0076
997 . . .
998 0.0163 0.0410 0.0279
999 . . .

1000 rows × 3 columns

Or, create a self-defined annotation file like this:

	Chr	Start	AF_amr	AF	AF_nfe	AF_afr
chr1:10140:ACCCTAAC:A	1	10140	0.0006	0.0007	0.0007	0.0008
chr1:10146:AC:A	1	10146	0.638	0.6328	0.6413	0.63
chr1:10150:CT:C	1	10150	0.0357	0.0375	0.037	0.0426
chr1:10172:CCCTAA:C	1	10172	0.0086	0.0082	0.0084	0.0097
chr1:10178:CCTAA:C	1	10178	0.5	0.3333	0.2955	0.4375
chr1:10198:TAACCC:T	1	10198	0.0	0.0	0.0	0.0
chr1:10231:C:A	1	10231	0.2	0.0366	0.0	0.05
chr1:10236:AACCCT:A	1	10236	0.0	0.0	0.0	0.0
chr1:10247:TAAACCCTA:T	1	10247	0.2222	0.2089	0.1429	0.4211

The index must match with the ID in vcf file.

  • --pop, The file of family population information
%%writefile testdata/test_fam_pop.txt
1033 AF_amr
1036 AF_amr
10J_103 AF_afr
10J_109 AF_nfe
10J_112 AF_nfe
10J_119 AF_nfe
Writing ../testdata/test_fam_pop.txt
  • --included-vars, The file with one column of variants For example:
chr1:10140:ACCCTAAC:A
chr1:10172:CCCTAA:C
chr1:10198:TAACCC:T
chr1:10236:AACCCT:A
chr1:10261:T:TA
chr1:10262:AACCCT:A
  • --blueprint, The blueprint file that defines regional marker (format: "chr startpos endpos name avg.distance male.distance female.distance"). The first four columns are required.
%%writefile testdata/test_blueprint_ext.txt
1       11868   14362   LOC102725121@1  9.177127474362311e-07   1.1657192989882668e-06  6.814189157634088e-07
1       11873   14409   DDX11L1 9.195320788455595e-07   1.1680302941673515e-06  6.82769803434766e-07
1       14361   29370   WASH7P  1.5299877409602128e-06  1.94345806118021e-06    1.136044574393209e-06
1       17368   17436   MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1 1.217692507120495e-06   1.5467668502473368e-06  9.041595098829462e-07
1       30365   30503   MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1       2.1295973889038703e-06  2.705108741548526e-06   1.5812659765416382e-06
1       34610   36081   FAM138A@1,FAM138C@1,FAM138F@1   2.4732411024120156e-06  3.1416201771056266e-06  1.8364278747737466e-06
1       69090   70008   OR4F5   4.866641545668504e-06   6.181823219621424e-06   3.6135725636621673e-06
1       134772  140566  LOC729737       9.633289838108921e-06   1.2236630588823159e-05  7.152898262617822e-06
1       490755  495445  LOC100132062@1,LOC100132287@1   2.2828130832833112e-05  2.8997300893994373e-05  1.6950315013571593e-05
1       450739  451678  OR4F16@1,OR4F29@1,OR4F3@1       2.575942360468604e-05   3.2720758549649544e-05  1.912685483821856e-05
1       627379  629009  LOC101928626    3.943568768003252e-05   5.009295373297854e-05   2.9281737249900675e-05
1       632614  632685  MIR12136        3.974742311959244e-05   5.048893386847169e-05   2.9513206656665908e-05
Overwriting testdata/test_blueprint_ext.txt

Or

%%writefile testdata/test_blueprint.txt
1       11868   14362   LOC102725121@1
1       11873   14409   DDX11L1
1       14361   29370   WASH7P
1       17368   17436   MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
1       30365   30503   MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1
1       34610   36081   FAM138A@1,FAM138C@1,FAM138F@1
1       69090   70008   OR4F5
1       134772  140566  LOC729737
1       490755  495445  LOC100132062@1,LOC100132287@1
1       450739  451678  OR4F16@1,OR4F29@1,OR4F3@1
1       627379  629009  LOC101928626
1       632614  632685  MIR12136
Overwriting testdata/test_blueprint.txt

Output format

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SEQLinkage-2.0.1.tar.gz (4.4 MB view hashes)

Uploaded Source

Built Distribution

SEQLinkage-2.0.1-py3-none-any.whl (42.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page